AACR2 lists four uses for uniform titles, but the most common is to group items that appear with multiple titles under a single heading. Works such as Don Quixote that are published in multiple languages and under hundreds of different titles benefit from this. Unfortunately, when trying to group manifestations into works, uniform titles do not always correspond to what anyone would consider a work.
We have been aware of this since we started trying to group bibliographic records into works (something we've dabbled in for nearly 20 years here at OCLC, and worked on seriously for half that time). My last post about controlling names was an unpleasant reminder us of this, since the most popular 'work' presented under our newly controlled J.S. Bach records is actually there because of a MARC21 240 (uniform title) field Selections. Our current work clustering always uses the 240 in preference to the title proper reflected in the 245 (title statement) field. Music has its own highly developed approach to uniform titles, but similar groupings occur in other areas.
In Bach's case we found 1,429 different titles collocated under Selections. For some of these, Selections might be the best place to put them, but others, such as Switched-On Bach (by Carlos) have multiple manifestation records, and a life of their own beyond simply Selections. Another case we've long known about is Treaties, etc. which groups treaties (e.g. Great Britain, Treaties, etc.). Although different treaties are obviously different works, that clustering somehow seems less surprising than hiding Switched-on Bach under Selections.
Some would probably argue that manifestations collected under Selections are really just themselves collections of works by Bach and some other mechanism is needed to get access to those works. I don't think there are any easy answers to this problem, but we are going to try out (here in OCLC Research first) a fairly simple approach. There are uniform titles that occur so many times that we consider them 'noise' titles for doing things like matching names. For FRBR processing we are going to try ignoring the top 25 uniform titles. Here they are, along with a count of how many times we see them in WorldCat:
3,125 SPEECHES
3,404 CANTATAS
3,873 QUARTETS\STRINGS
4,377 CHORAL MUSIC
4,662 CONSTITUTION
4,761 CHAMBER MUSIC
5,263 ESSAYS
5,428 OPERAS
5,535 SONATAS\PIANO
5,585 SYMPHONIES
7,016 ANNUAL REPORT
7,361 ORGAN MUSIC
8,333 VOCAL MUSIC
8,929 PLAYS
11,483 ORCHESTRA MUSIC
12,899 CORRESPONDENCE
13,191 INSTRUMENTAL MUSIC
14,811 SHORT STORIES
23,098 PIANO MUSIC
24,406 TREATIES ETC
26234 SONGS
46,877 POEMS
58,303 LAWS ETC
59,210 WORKS
91,940 SELECTIONS
There are a number of other generic uniform titles beyond the top 25, but at that point we start to see uniform titles for works (e.g. The Book of Common Prayer is #26).
This isn't our first abandonment of the 240 field. WorldCat Identities originally preferred the 240 to the 245 for the work display. Unfortunately relatively few people benefited from seeing Prestuplenie i nakazanie instead of Crime and Punishment, so we switched to using the most common form of the 245 for display.
Note: The list of common uniform titles is in upper case because of normalization. In the past we normalized to lower case for ease of reading, but the latest version of PCC/NACO normalization uses Unicode mappings to normalize case, and since some of these mappings are only available into uppercase, we are following their guidelines and switching to it.
--Th
Update (3 December 2008): We couldn't stand the uppercase, so after we've done the normalization we now 'lower' the characters that have a lower case character associated with them.
The list of uniform titles to ignore hasn't changed much, except that 'quartets\strings', 'sonatas\piano' and 'symphonies' have been removed. For non 240 titles we have a longer list generated algorithmically. (For VIAF name matching we have similar lists of titles we don't trust to bring names together, one for each authority file we are processing.)
Re: your comment "Our current work clustering always uses the 240 in preference to the title proper reflected in the 245 (title statement) field." I ran into a problem with that logic while trying to search for anthologies of comics in Worldcat.org. Search "au:henley marian" brings up 10 records. "Maxine" is one of the titles I wanted. When I click on that title, I see a note "2 editions". Clicking on that I see another title (Laughing Gas), which is not another edition at all--it's a different compilation of the Maxine cartoons, but both records have "Maxine! |k Selections" in the 240 field, and so Worldcat.org groups them together. In this instance, that 240-over-245 logic should not be used.
Response:
Yes, what we are proposing would avoid that match.
--Th
Posted by: Patt Leonard | April 29, 2008 at 19:57
Doesn't this problem indicate that the use of MARC tag 243 (bibliographic), which hasn't been implemented in systems following LC practice, would be worthwhile? Would it be feasible to implement it retrospectively, using the technique described in your blog about automated linking of bib headings to authorities? There's no comparable provision for name-title tags.
Unfortunately, in the bib format 1XX $u is already taken (though it's unassigned in the authority format).
Response:
Yes, I suppose this would be possible, but you would probably have to get LC to change its practice.
--Th
Posted by: Hal Cain | April 29, 2008 at 21:18
The conventional title 240 is definitely a different animal, and not a FRBR "work" title. A thornier question: what does WC Identities mean by "most widely held works"? In Dostoyevsky's list, "The Gambler" appears three times. Is it three works? Crime and Punishment appears both at the top and at the bottom of the expanded list. So what does "work" mean in WC Identities? In this list, it appears that the FRBR "work" doesn't really "count."
Response:
We group the manifestations into 'works' as best we can based on author/title. In this case, 'The Gambler' was combined with other works, confusing us. In the future we hope to recognize that a manifestation might contain multiple works, but our software isn't up to that challenge yet (they can get very messy with our current cataloging).
--Th
Posted by: Stephen Hearn | April 30, 2008 at 11:32
As part of our Variations3 project here at IU, we are investigating ways to derive work records from MARC records. Not suprisingly, we have spent a great deal of effort dealing with these "collective titles". Happily, in most cases, when the 240 has one of these values, there are analytical 700 added-entry fields from which to derive works. Beyond that, the other source for work-record data is the 505 contents field, which is significantly less bound to a specific format (and thus all but impossible for a machine to parse meaningfully). The upshot of this is: we also ignore these "generic" 240 values. In some cases, they are simply place-holders in my opinion, and useful only for filing in catalog displays. In a FRBR-ized environment, however, they would be even less necessary. Nonetheless, I do not agree with the current effort by the Joint Steering Committee to eliminate the use of "selections" altogether.
Casey Mullin
Metadata Assistant -- Variations3 Digital Music Library
Indiana University
Posted by: Casey Mullin | May 01, 2008 at 16:09
As you note, "Music has its own highly developed approach to uniform titles." So, what is the impact of this decision that considers more than a dozen music-related collective uniform titles "noise"? Isn't the real problem, as Hal Cain has already noted, the fact that these are not coded as collective uniform titles? When WorldCat attempts to group bibliographic records into works, when is it appropriate to use some of these collective uniform titles? For example, do you want to bring together *all* sound recording expressions containing Beethoven's nine symphonies, or should this be restricted to the same performances (e.g., on LP, cassette, CD)?
Response: Actually we are taking another look at some of these. In particular manifestations that use 'Symphonies' don't seem to have any place better to go. I'll blog about this after we get a little more data.
--Th
Posted by: Kathy Glennan | May 01, 2008 at 16:15
A couple of uniform titles on your list are in fact intended to be used for single works, but only make sense in conjunction with the 1XX field used in conjunction with them. These are "Constitution" and "Annual report".
The MARC field "240 Constitution" does not refer in itself to a single work, but the combination of "110 United States" with "240 Constitution" does refer to a single work. Similarly, "240 Annual report" by itself is not a single work, but in conjunction with "110 OCLC" is does refer to a single (serial) work.
Others on your list can refer to identifiable collections of works wthat are useful to treat as a single entity. For example, "240 Symphonies" in conjuction with "100 Beethoven, Ludwig van, 1770-1827" refers to a collection of 9 works which are often published together, and so are useful to refer to together. So that collection would usefully have its own identity.
I don't think you can discuss 240 headings without looking at their relationship with 1XX headings at all, just as you can't use 245 fields by themselves (since it's quite common for different works to have the same 245 title.
Response:
Yes, we always combine the 240 with a 1XX.
--Th
Posted by: Giles Martin | May 01, 2008 at 16:26
Out of curiosity, are you planning on only ignorning the 25 common uniform titles when they are used for more than one specific work?
Fore example, would you be ignoring "Symphonies", but NOT ignoring "Symphonies, no. 3, op. 55, Eb major"?
Based on the numbers, it looks like you are planning on ignoring the all-encompassing titles, rather than the titles for individual works, but I just wanted to be clear!
Response:
Yes, it would only be for those with 'Symphonies'. As I mentioned on another comment, though, Symphonies is probably one that we won't use, along a few of the others. We haven't quite finished analyzing these.
--Th
Posted by: Michelle Hahn | May 01, 2008 at 16:27
How are the computed clusters being persisted?
Do you store a control number in the $0 subfield?
Response:
OCLC bibliographic records are stored in an internal XML schema (we call it CDF for Common Data Format), and the work ID is just another field in that. In the Research tests I'm describing, we don't have a persistent identifier other than the author/title key we finally end up with after quite a bit of processing.
--Th
Posted by: Conal | May 01, 2008 at 18:08
Thom, I was really wondering whether this authority work was going to leak back out into the MARC world, or remain internal to OCLC. I realise the work is still experimental, but do you have plans to publish the clusters in MARC form?
Response:
I am not aware of any plans for publishing them (beyond WorldCat), but it is an interesting suggestion, and I'll give it some thought.
--Th
Posted by: Conal | May 06, 2008 at 17:46