Last night we controlled 63,479 Johann Sebastian Bach's in WorldCat, as part of a run that linked nearly 500,000 personal name headings to their associated LC-NACO authority record.
WorldCat has had the capability for years to link headings to authority records, but for various reasons we have never done a systematic linking of all the existing headings. I started a project a couple of months ago to attack at least the easier parts of the problem.
As part of the processing we do for WorldCat Identities we try to come up with LCCNs for as many of our names as possible, and when that fails we use identifiers from the Virtual International Authority File. Those identifiers let us link to J. S. Bach with URLs like http://worldcat.org/identities/lccn-n79-21425, instead of a text string based on the name. Most people (at least the ones that care at all about this) seem to prefer using the LCCN rather than the text string in the URLs. They look more permanent, and in general they are, but systems like Identities need to accommodate deletions, merges, and splits of authority records, no matter what the form of the URL, so the difference seems a matter of degree, not fundamental.
Getting more of WorldCat's headings linked to authority records has a number of benefits. It gives us a chance to merge some variant forms of headings and makes it easier to update the database when names change. This has become a substantial problem for us since LC changed their policy on adding death dates to headings. Right now we are working our way through the a set of fairly easy 26 million headings, personal names that match an authority record on multiple subfields. If this works, we will look at controlling names that are harder to match.
Unfortunately we do not have any authority files ready to link to other than the LC-NACO file, so names in records with non-English cataloging are ignored, as are subject headings that are not LCSH. We also are not controlling name-title headings. Connexion likes to control these as a unit, linking to a name-title authority record. For Identities we link the name part of the heading to the LCCN for that name. I'd be interested to know how other systems handle name-title linking.
Some might be interested in how the names are being linked. From our controlled copy of WorldCat we use to generate Identities, Jenny Toves extracted the 26 million easily matched name headings and generated 128 files of approximately 200,000 headings each. To update WorldCat we wrote a small map-reduce job that starts up several Python programs that pretend to be Connexion clients. Each of these reads in a heading, retrieves the bibliographic record from Connexion, modifies it to control the heading, and then does a replace on the record. The replace locks the master record and updates it in the database. If we run into a problem with a record, we just skip it and continue on. The 16 clients that were running last night were linking about 8 headings/second.