As WorldCat becomes more open (e.g. WorldCat.org), it is interesting to look at building links around it. One of the things I've been thinking about lately is something we're tentatively calling WorldCat Identities. WC Identities would create a summary page for each person referenced in WorldCat. It would be a another way to view WorldCat.
One of the things we thought would be useful is a link into Wikipedia, especially if Wikipedia has an article about the person. So, I extracted some personal names out of WorldCat and tried to see if I could generate the appropriate links.
The first thing I found was that Wikipedia didn't want to respond to my Python urllib requests to view pages. Maybe they've had too many people attacking using Python. I'm sure there is a way around that, but it turns out there is a much better way to test whether Wikipedia has an article or not. You can download a list of all the article titles (e.g. http://download.wikimedia.org/enwiki/20060810/enwiki-20060810-all-titles-in-ns0.gz from the English edition) and work with that. This list has some 2.3 million articles of one sort or another in it, and it didn't turn out to be too hard to convert many of the names found in the bibliographic records into article titles in Wikipedia.
So far I've pulled three different sets of personal names from WorldCat: All the 100 fields (personal name main entry), all the 600 fields (personal name as subject) and a combination of all 100 and 700 (personal name added entry) fields. I then calculated a score for each unique name based on the number of WorldCat records it occurred in, and the library holdings for those records. From that it is easy to find the most commonly used names and see if there is a corresponding Wikipedia article.
Here is how many matches I found for the most popular names in the three sets:
|100's||100's & 700's||600's|
Nothing too surprising here, except maybe that with little effort I was able to match 97% of the 1,000 most popular authors in WorldCat to an article in Wikipedia. Going beyond the most popular 1,000 names would take a more sophisticated matching process. Even with the 1,000 most popular, the link is occasionally to a Wikipedia disambiguation page that could be avoided.
The names that didn't match Wikipedia seemed to be heavy on pre-20th century biblical scholars. The one name from the top 100 authors that wasn't in Wikipedia was Ernie Deane, a photographer who has lots of records in WorldCat, mostly from the Arkansas History Commission.