Uniform titles 2

CrimeWe are changing our processing of 240's (Uniform titles) in another way than the skipping of generic ones described in the last post.  In this case we are paying more attention to them.

We try to respect the LC/NACO Name Authority File (NAF).  If we want to compare two titles and they both have name/title entries in the LAF, then we don't merge them into a single workset no matter how similar they are.  We also use the uniform title for title comparisons (except possibly for some of the more generic ones).  This works fine for titles that are in the authority file, however many uniform titles are in the records as 240's, but never get their own name/title record in the NAF.  The practice seems to be that unless additional information needs to be attached to the title, the 240 in the record is sufficient.

The example we found was Dostoyevsky's Crime and Punishment versus The Notebooks for Crime and Punishment (actually  Prestuplenie i nakazanie vs. Prestuplenie i nakazanie; neizdannye materialy).  Our latest experiments in Research brought those together (because only Prestuplenie i nakazanie has a name/title record).  The cataloging (with two distinct 240's) would indicate that is an error.

So, now we plan to treat 240's in LC records as authoritative.  That will clear up the Crime and Punishment problem, but we are still struggling to separate Mrs. Piggle Wiggle from Mrs. Piggle Wiggle's Farm without breaking too many other good matches.

Update:  We are seeing just under 86,000 new name/titles derived from the 1XX and 240 in LC records.  Some of them are in records that total thousands of library holdings.

 

--Th

FRBR and uniform titles

Switchedonbach200 AACR2 lists four uses for uniform titles, but the most common is to group items that appear with multiple titles under a single heading.  Works such as Don Quixote that are published in multiple languages and under hundreds of different titles benefit from this.  Unfortunately, when trying to group manifestations into works, uniform titles do not always correspond to what anyone would consider a work.

We have been aware of this since we started trying to group bibliographic records into works (something we've dabbled in for nearly 20 years here at OCLC, and worked on seriously for half that time).  My last post about controlling names was an unpleasant reminder us of this, since the most popular 'work' presented under our newly controlled J.S. Bach records is actually there because of a MARC21 240 (uniform title) field Selections.  Our current work clustering always uses the 240 in preference to the title proper reflected in the 245 (title statement) field.  Music has its own highly developed approach to uniform titles, but similar groupings occur in other areas. 

In Bach's case we found 1,429 different titles collocated under Selections.  For some of these, Selections might be the best place to put them, but others, such as Switched-On Bach (by Carlos) have multiple manifestation records, and a life of their own beyond simply Selections.  Another case we've long known about is Treaties, etc. which groups treaties (e.g. Great Britain, Treaties, etc.).  Although different treaties are obviously different works, that clustering somehow seems less surprising than hiding Switched-on Bach under Selections.

Some would probably argue that manifestations collected under Selections are really just themselves collections of works by Bach and some other mechanism is needed to get access to those works.  I don't think there are any easy answers to this problem, but we are going to try out (here in OCLC Research first) a fairly simple approach.  There are uniform titles that occur so many times that we consider them 'noise' titles for doing things like matching names.  For FRBR processing we are going to try ignoring the top 25 uniform titles.  Here they are, along with a count of how many times we see them in WorldCat:

3,125    SPEECHES
3,404    CANTATAS
3,873    QUARTETS\STRINGS
4,377    CHORAL MUSIC
4,662    CONSTITUTION
4,761    CHAMBER MUSIC
5,263    ESSAYS
5,428    OPERAS
5,535    SONATAS\PIANO
5,585    SYMPHONIES
7,016    ANNUAL REPORT
7,361    ORGAN MUSIC
8,333    VOCAL MUSIC
8,929    PLAYS
11,483   ORCHESTRA MUSIC
12,899   CORRESPONDENCE
13,191   INSTRUMENTAL MUSIC
14,811   SHORT STORIES
23,098   PIANO MUSIC
24,406   TREATIES ETC
26234   SONGS
46,877   POEMS
58,303   LAWS ETC
59,210   WORKS
91,940   SELECTIONS

There are a number of other generic uniform titles beyond the top 25, but at that point we start to see uniform titles for works (e.g. The Book of Common Prayer is #26).

This isn't our first abandonment of the 240 field.  WorldCat Identities originally preferred the 240 to the 245 for the work display. Unfortunately relatively few people benefited from seeing Prestuplenie i nakazanie instead of Crime and Punishment, so we switched to using the most common form of the 245 for display.

Note: The list of common uniform titles is in upper case because of normalization.  In the past we normalized to lower case for ease of reading, but the latest version of PCC/NACO normalization uses Unicode mappings to normalize case, and since some of these mappings are only available into uppercase, we are following their guidelines and switching to it.

--Th

Controlling names in WorldCat

Bachcover Last night we controlled 63,479 Johann Sebastian Bach's in WorldCat, as part of a run that linked nearly 500,000 personal name headings to their associated LC-NACO authority record.

WorldCat has had the capability for years to link headings to authority records, but for various reasons we have never done a systematic linking of all the existing headings.  I started a project a couple of months ago to attack at least the easier parts of the problem.

As part of the processing we do for WorldCat Identities we try to come up with LCCNs for as many of our names as possible, and when that fails we use identifiers from the Virtual International Authority File.  Those identifiers let us link to J. S. Bach with URLs like http://worldcat.org/identities/lccn-n79-21425, instead of a text string based on the name.  Most people (at least the ones that care at all about this) seem to prefer using the LCCN rather than the text string in the URLs.  They look more permanent, and in general they are, but systems like Identities need to accommodate deletions, merges, and splits of authority records, no matter what the form of the URL, so the difference seems a matter of degree, not fundamental.

Getting more of WorldCat's headings linked to authority records has a number of benefits.  It gives us a chance to merge some variant forms of headings and makes it easier to update the database when names change.  This has become a substantial problem for us since LC changed their policy on adding death dates to headings.  Right now we are working our way through the a set of fairly easy 26 million headings, personal names that match an authority record on multiple subfields.  If this works, we will look at controlling names that are harder to match.

Unfortunately we do not have any authority files ready to link to other than the LC-NACO file, so names in records with non-English cataloging are ignored, as are subject headings that are not LCSH.  We also are not controlling name-title headings.  Connexion likes to control these as a unit, linking to a name-title authority record.  For Identities we link the name part of the heading to the LCCN for that name.  I'd be interested to know how other systems handle name-title linking.

Some might be interested in how the names are being linked.  From our controlled copy of WorldCat we use to generate Identities, Jenny Toves extracted the 26 million easily matched name headings and generated 128 files of approximately 200,000 headings each.  To update WorldCat we wrote a small map-reduce job that starts up several Python programs that pretend to be Connexion clients.  Each of these reads in a heading, retrieves the bibliographic record from Connexion, modifies it to control the heading, and then does a replace on the record.  The replace locks the master record and updates it in the database.  If we run into a problem with a record, we just skip it and continue on. The 16 clients that were running last night were linking about 8 headings/second.

--Th

What people skip

SkippingJenny and I have been looking at differences in the WorldCat.org FRBR clustering and the clustering we do here in Research.  Ideally we'd like them to be the same, but we keep fussing with ours in Research, so we knew there would be differences.

Comparing two large sets of clusters isn't all that straight-forward.  With two implementations there are times when the clusters are the same, one is just larger than the other, or the very common case when each of the clusters has records that the corresponding cluster doesn't, so several clusters may be involved.

One of the first things we looked at was clusters from WorldCat.org that were larger than ours.  We have noticed several differences that imply some different title processing, one of which was a difference in how leading articles are handled.  In general, when comparing a title that has a skip indicator (see note below) we drop the characters indicated in the indicator before comparing titles.  In addition we drop 'the' and 'an', two common English articles.  And yes we know 'the' isn't an article in French; we still had to do it.  We don't automatically skip 'a' because that is too commonly not an article (e.g. 'A A A')  The skip indicators in WorldCat are pretty reliable, but recognizing common articles at the beginning of titles that have not been manually skipped is still worth the trouble.  Comparing the clusters, we noticed in at least one case where it looked like WorldCat.org made a title match by dropping 'die' from a title, so we wondered if we could improve our handling of leading articles.

So Jenny ran a scan of some WorldCat records counting all the text people have said to skip in the title fields we are interested in (245, 242, 130, and 740).  Here's a table showing the patterns that occurred more than 30,000 times in a 1/10 sample of WorldCat:

783,372    the
257,879    a
153,742    die
101,760    la
74,830    an
62,483    der
55,712    le
50,348    das
46,585    les
43,721    de            
42,217    l
32,548    el

That looked promising.  It would be nice to use a table derived from WorldCat to control this rather than an ad hoc table.  Next we looked at combining the language of the manifestation with the text skipped:

740,712    the (eng)
233,475    a (eng)
131,703    die (ger)
71,963    an (eng)
54,740    der (ger)
49,229    la (fre)
47,127    le (fre)
44,690    das ger)
43,046    les (fre)
37,745   de (dut)
32,580    l (fre)
31,066    la (spa)
30,494    el (spa)

Now this starts to look like a useful list.  We're going to try using it and see if it helps.  Again, we can't really use a in English, and probably not l in French, so we'll probably just throw out the single letter values.  To give you some idea of the number of titles the  would affect, with the list above (less a) we found about 12,000 titles changed in the 10% WorldCat sample, including 4,522 the (eng), 826 el (spa), and 782 le (fre).  So, around 120,000 titles would be affected in all of WorldCat.  I'd class that as a minor improvement, but probably worth doing.

Of course, it's interesting to look at some of the less commonly skipped strings:

euro 3 times
of the 4 times
1:00 AM 3 times

plus lots and lots of numbers: cardinal, ordinal, and Roman.

--Th

Notes:  For those not familiar with MARC21, some title fields carry along a single digit indicator which tells how many characters should be dropped from the beginning of the title for sorting (called nonfiling characters in librarian).

The three letter language codes are defined in the MARC21 documentation. It would be nice if they were the same as in ISO 639-2 Codes for the representation of names of languages-- Part 2: alpha-3 code, but nothing is quite that simple, even though LC maintains both.

Update (Feb 6, 2008).  Rebecca Guenther at LC pointed out that my note above about ISO 639-2 isn't correct.  Here is her explanation: The language codes in ISO 639-2/B are identical to those in MARC. There are 22 languages that have these alternative codes, called ISO 639-2/B and ISO 639-2/T ... . All the other languages in 639-2 are the same. So MARC is the same as 639-2 and in the cases of languages with alternative codes the MARC ones are the 639-2/B codes. The language names are not the same, but that is not what is being standardized-- it is the codes. In other words you don't need the MARC documentation to apply the 639-2 codes, you just need to use the set that is 639-2/B in these 22 cases.

My Photo

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31