Last night we controlled 63,479 Johann Sebastian Bach's in WorldCat, as part of a run that linked nearly 500,000 personal name headings to their associated LC-NACO authority record.
WorldCat has had the capability for years to link headings to authority records, but for various reasons we have never done a systematic linking of all the existing headings. I started a project a couple of months ago to attack at least the easier parts of the problem.
As part of the processing we do for WorldCat Identities we try to come up with LCCNs for as many of our names as possible, and when that fails we use identifiers from the Virtual International Authority File. Those identifiers let us link to J. S. Bach with URLs like http://worldcat.org/identities/lccn-n79-21425, instead of a text string based on the name. Most people (at least the ones that care at all about this) seem to prefer using the LCCN rather than the text string in the URLs. They look more permanent, and in general they are, but systems like Identities need to accommodate deletions, merges, and splits of authority records, no matter what the form of the URL, so the difference seems a matter of degree, not fundamental.
Getting more of WorldCat's headings linked to authority records has a number of benefits. It gives us a chance to merge some variant forms of headings and makes it easier to update the database when names change. This has become a substantial problem for us since LC changed their policy on adding death dates to headings. Right now we are working our way through the a set of fairly easy 26 million headings, personal names that match an authority record on multiple subfields. If this works, we will look at controlling names that are harder to match.
Unfortunately we do not have any authority files ready to link to other than the LC-NACO file, so names in records with non-English cataloging are ignored, as are subject headings that are not LCSH. We also are not controlling name-title headings. Connexion likes to control these as a unit, linking to a name-title authority record. For Identities we link the name part of the heading to the LCCN for that name. I'd be interested to know how other systems handle name-title linking.
Some might be interested in how the names are being linked. From our controlled copy of WorldCat we use to generate Identities, Jenny Toves extracted the 26 million easily matched name headings and generated 128 files of approximately 200,000 headings each. To update WorldCat we wrote a small map-reduce job that starts up several Python programs that pretend to be Connexion clients. Each of these reads in a heading, retrieves the bibliographic record from Connexion, modifies it to control the heading, and then does a replace on the record. The replace locks the master record and updates it in the database. If we run into a problem with a record, we just skip it and continue on. The 16 clients that were running last night were linking about 8 headings/second.
--Th
Awesome. I happened to hear about this project yesterday. Brenda Block and Kathy Kie gave a web talk about WorldCat database quality for some Independent libraries. Will you attempt to control the names that appear on the thousands of article records entering WorldCat.org? Do you have a sense for the percentage of names in article records that might match up with existing name authority records? Do you ever use other details like classification numbers or subject headings to suggest possible matches for the more difficult names that appear in bib records? I am just curious about all this. Thanks.
Bryan
Response:
We are interested in controlling names in articles, but there are no plans in place and we don't even have a good estimate of the overlaps. Right now most of our matching is based on author and title. We do use some of those other clues for matching in VIAF, though.
--Th
Posted by: Bryan | April 23, 2008 at 16:12
How will you be dealing with undifferentiated names? Or names that match an authority but are the incorrect heading for the particular entity?
Response:
Right now all we are doing are the 'easy' headings (only 2% change their text); there is very little correction being done. In other processing (e.g. WorldCat Identities) we try to 'fix' headings by transforming them into a standard form. Undifferentiated names, however, just get lumped together for now.
--Th
Posted by: Adam Schiff | April 25, 2008 at 16:38
Very impressive. Are you doing anything different with matches on non-roman bib name forms, since these can now appear as 400s in LCNAF authorities?
Response:
I don't think I've ever seen non-Latin forms in a X00 field in WorldCat, but if they did, and if the authority record 400 matched, and it had multiple subfields, we should match them up (and probably eliminate the non-Latin, since the preferred form of the name in LCNAF is still going to be in Latin script).
--Th
Posted by: Stephen Hearn | April 30, 2008 at 11:04
1. Will this process be touching the Institutional records?
2. I'm also interested to hear what your plans are for unique names that have no dates are any other qualifying data. We have had some wild flips made through automated authority control in the past.
Response:
1. No, we are not touching the institutional records.
2. We expect to use a variant of how we match names for VIAF, which depends on having a title or other information match beyond a simple string match on the name.
--Th
Posted by: Deborah J. Leslie | April 30, 2008 at 15:49
"I don't think I've ever seen non-Latin forms in a X00 field in WorldCat, but if they did, and if the authority record 400 matched, and it had multiple subfields, we should match them up (and probably eliminate the non-Latin, since the preferred form of the name in LCNAF is still going to be in Latin script)."
There are lots of non-Roman headings in X00 fields in WorldCat, at least that's what they look like in the Cnx Client, as linked parallel fields to other X00s. Are you referring here to non-linked X00s in bibs, that is, ones that don't have hidden 880s? Given the multiple sources that OCLC is now pulling records from, I think it is only a matter of time before completely non-Roman records start appearing. Do you really want to flip those headings to Latin forms? Say the records are coming from the Russian National Library!
With article citation records, and records from other national libraries that don't follow AACR2, the environment "our" records live in is changing rapidly. "We" follow Model A for bibs, but will follow Model B for authority records. I can't put it all together in my mind, how will authority control work in this new heterogeneous environment?
Response: I'm only talking about personal name headings in English language metadata (as indicated in the 040 $b). I think we already need authorities for other cataloging languages in WorldCat, but we do try to keep records as close to MARC-21 as we can.
--Th
--Th
Posted by: Diana Brooking | May 15, 2008 at 13:20