« January 2008 | Main | May 2008 »

FRBR and uniform titles

Switchedonbach200 AACR2 lists four uses for uniform titles, but the most common is to group items that appear with multiple titles under a single heading.  Works such as Don Quixote that are published in multiple languages and under hundreds of different titles benefit from this.  Unfortunately, when trying to group manifestations into works, uniform titles do not always correspond to what anyone would consider a work.

We have been aware of this since we started trying to group bibliographic records into works (something we've dabbled in for nearly 20 years here at OCLC, and worked on seriously for half that time).  My last post about controlling names was an unpleasant reminder us of this, since the most popular 'work' presented under our newly controlled J.S. Bach records is actually there because of a MARC21 240 (uniform title) field Selections.  Our current work clustering always uses the 240 in preference to the title proper reflected in the 245 (title statement) field.  Music has its own highly developed approach to uniform titles, but similar groupings occur in other areas. 

In Bach's case we found 1,429 different titles collocated under Selections.  For some of these, Selections might be the best place to put them, but others, such as Switched-On Bach (by Carlos) have multiple manifestation records, and a life of their own beyond simply Selections.  Another case we've long known about is Treaties, etc. which groups treaties (e.g. Great Britain, Treaties, etc.).  Although different treaties are obviously different works, that clustering somehow seems less surprising than hiding Switched-on Bach under Selections.

Some would probably argue that manifestations collected under Selections are really just themselves collections of works by Bach and some other mechanism is needed to get access to those works.  I don't think there are any easy answers to this problem, but we are going to try out (here in OCLC Research first) a fairly simple approach.  There are uniform titles that occur so many times that we consider them 'noise' titles for doing things like matching names.  For FRBR processing we are going to try ignoring the top 25 uniform titles.  Here they are, along with a count of how many times we see them in WorldCat:

3,125    SPEECHES
3,404    CANTATAS
3,873    QUARTETS\STRINGS
4,377    CHORAL MUSIC
4,662    CONSTITUTION
4,761    CHAMBER MUSIC
5,263    ESSAYS
5,428    OPERAS
5,535    SONATAS\PIANO
5,585    SYMPHONIES
7,016    ANNUAL REPORT
7,361    ORGAN MUSIC
8,333    VOCAL MUSIC
8,929    PLAYS
11,483   ORCHESTRA MUSIC
12,899   CORRESPONDENCE
13,191   INSTRUMENTAL MUSIC
14,811   SHORT STORIES
23,098   PIANO MUSIC
24,406   TREATIES ETC
26234   SONGS
46,877   POEMS
58,303   LAWS ETC
59,210   WORKS
91,940   SELECTIONS

There are a number of other generic uniform titles beyond the top 25, but at that point we start to see uniform titles for works (e.g. The Book of Common Prayer is #26).

This isn't our first abandonment of the 240 field.  WorldCat Identities originally preferred the 240 to the 245 for the work display. Unfortunately relatively few people benefited from seeing Prestuplenie i nakazanie instead of Crime and Punishment, so we switched to using the most common form of the 245 for display.

Note: The list of common uniform titles is in upper case because of normalization.  In the past we normalized to lower case for ease of reading, but the latest version of PCC/NACO normalization uses Unicode mappings to normalize case, and since some of these mappings are only available into uppercase, we are following their guidelines and switching to it.

--Th

Update (3 December 2008): We couldn't stand the uppercase, so after we've done the normalization we now 'lower' the characters that have a lower case character associated with them.

The list of uniform titles to ignore hasn't changed much, except that 'quartets\strings', 'sonatas\piano' and 'symphonies' have been removed. For non 240 titles we have a longer list generated algorithmically. (For VIAF name matching we have similar lists of titles we don't trust to bring names together, one for each authority file we are processing.)

Controlling names in WorldCat

Bachcover Last night we controlled 63,479 Johann Sebastian Bach's in WorldCat, as part of a run that linked nearly 500,000 personal name headings to their associated LC-NACO authority record.

WorldCat has had the capability for years to link headings to authority records, but for various reasons we have never done a systematic linking of all the existing headings.  I started a project a couple of months ago to attack at least the easier parts of the problem.

As part of the processing we do for WorldCat Identities we try to come up with LCCNs for as many of our names as possible, and when that fails we use identifiers from the Virtual International Authority File.  Those identifiers let us link to J. S. Bach with URLs like http://worldcat.org/identities/lccn-n79-21425, instead of a text string based on the name.  Most people (at least the ones that care at all about this) seem to prefer using the LCCN rather than the text string in the URLs.  They look more permanent, and in general they are, but systems like Identities need to accommodate deletions, merges, and splits of authority records, no matter what the form of the URL, so the difference seems a matter of degree, not fundamental.

Getting more of WorldCat's headings linked to authority records has a number of benefits.  It gives us a chance to merge some variant forms of headings and makes it easier to update the database when names change.  This has become a substantial problem for us since LC changed their policy on adding death dates to headings.  Right now we are working our way through the a set of fairly easy 26 million headings, personal names that match an authority record on multiple subfields.  If this works, we will look at controlling names that are harder to match.

Unfortunately we do not have any authority files ready to link to other than the LC-NACO file, so names in records with non-English cataloging are ignored, as are subject headings that are not LCSH.  We also are not controlling name-title headings.  Connexion likes to control these as a unit, linking to a name-title authority record.  For Identities we link the name part of the heading to the LCCN for that name.  I'd be interested to know how other systems handle name-title linking.

Some might be interested in how the names are being linked.  From our controlled copy of WorldCat we use to generate Identities, Jenny Toves extracted the 26 million easily matched name headings and generated 128 files of approximately 200,000 headings each.  To update WorldCat we wrote a small map-reduce job that starts up several Python programs that pretend to be Connexion clients.  Each of these reads in a heading, retrieves the bibliographic record from Connexion, modifies it to control the heading, and then does a replace on the record.  The replace locks the master record and updates it in the database.  If we run into a problem with a record, we just skip it and continue on. The 16 clients that were running last night were linking about 8 headings/second.

--Th

My Photo

June 2009

Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30