« July 2006 | Main | September 2006 »

WorldCat names in Wikipedia

Worldcat_logo_1 Wikipedia As WorldCat becomes more open (e.g. WorldCat.org), it is interesting to look at building links around it.  One of the things I've been thinking about lately is something we're tentatively calling WorldCat Identities.  WC Identities would create a summary page for each person referenced in WorldCat.  It would be a another way to view WorldCat.

One of the things we thought would be useful is a link into Wikipedia, especially if Wikipedia has an article about the person.  So, I extracted some personal names out of WorldCat and tried to see if I could generate the appropriate links.

The first thing I found was that Wikipedia didn't want to respond to my Python urllib requests to view pages.  Maybe they've had too many people attacking using Python.  I'm sure there is a way around that, but it turns out there is a much better way to test whether Wikipedia has an article or not.  You can download a list of all the article titles (e.g. http://download.wikimedia.org/enwiki/20060810/enwiki-20060810-all-titles-in-ns0.gz from the English edition) and work with that.  This list has some 2.3 million articles of one sort or another in it, and it didn't turn out to be too hard to convert many of the names found in the bibliographic records into article titles in Wikipedia.

So far I've pulled three different sets of personal names from WorldCat:  All the 100 fields (personal name main entry), all the 600 fields (personal name as subject) and a combination of all 100 and 700 (personal name added entry) fields.  I then calculated a score for each unique name based on the number of WorldCat records it occurred in, and the library holdings for those records.  From that it is easy to find the most commonly used names and see if there is a corresponding Wikipedia article.

Here is how many matches I found for the most popular names in the three sets:

100's100's & 700's600's
Top 1010 10 10
Top 10099 99 100
Top 1,000966 875 948
Total Names11,231,731 17,955,544 1,769,781

Nothing too surprising here, except maybe that with little effort I was able to match 97% of the 1,000 most popular authors in WorldCat to an article in Wikipedia.  Going beyond the most popular 1,000 names would take a more sophisticated matching process.  Even with the 1,000 most popular, the link is occasionally to a Wikipedia disambiguation page that could be avoided.

The names that didn't match Wikipedia seemed to be heavy on pre-20th century biblical scholars.  The one name from the top 100 authors that wasn't in Wikipedia was Ernie Deane, a photographer who has lots of records in WorldCat, mostly from the Arkansas History Commission.

--Th

cElementTree

Pythonlogo Someone reading the last post about UNIMARC might wonder why we aren't just doing everything in XML.  Ignoring the problem of transforming the UNIMARC into XML (I'm sure it's been done), my group has tended to avoid using XML for most of our bibliographic processing.  One reason is a space.  Standard MARC21/slim XML is 3x the size of MARC Communications Format (MCF) records.  A copy of WorldCat would go from under 60 gigabytes to nearly 180 gigabytes.  An even more compelling reason was speed.  We do a lot in Python using the standard Python tools, and processing XML took about 6x as long as our routines that parse the same records in MCF.

So I was interested to notice that Python 2.5 (currently in beta-3) includes the core of ElementTree in it.  Here's what effbot.org says about ElementTree:

The Element type is a simple but flexible container object, designed to store hierarchical data structures, such as simplified XML infosets, in memory. The element type can be described as a cross between a Python list and a Python dictionary.

Even more interesting is that the Python distribution includes a C implementation of it (cElementTree).  Would that be fast enough to make XML worth the space problems?  What if we just compressed our XML files?  XML files are nearly 3x the size of MCF, but compressed they are 3x smaller than MCF.  I timed three possibilities reading in 1.4 million bibliographic records, extracting titles out of each record.  Here are the results using cElementTree to parse the XML, our standard Python routines for MCF:

Format Size in bytesRun timeRecs/sec
MCF1,372,366,461 455 3,081
XML3,617,800,493 302 4,641
Zipped XML401,668,287 343 4,087

cElementTree has a 10x speed advantage over ElementTree and supports the iterparse method that lets you step through the file without reading it all in, sort of SAXlike.  In addition ElementTree seems to be a more Pythonic way of manipulating XML than the standard SAX and DOM approaches.

It looks to me as though cElementTree is a winner, especially if we go to the trouble of compressing our larger files.  Now if only it supported full XPath...

--Th

UNIMARC

Ifla I spent a couple of hours yesterday trying to read some UNIMARC records.  I found the UNIMARC Manual and started coding up a Python class that could read it in.  The record starts with a Record Label that needs to be decoded, followed by a directory, very similar to the Leader and Directory in MARC-21.  So, I'm merrily coding along and noticing that this is looking very similar to our class that imports OCLC MARC.  Very similar.  Actually, in terms of just getting the records read in, identical, other than the Unicode indicator (byte 9 of the leader) in MARC-21.  I felt a little dumb when this is what I ended up with:

# inherit from OCLC MARC
class UniMarc(omarc.OMarc):
    def isUnicode(self):
        return True  # just assume it is!

Of course the field tags are all different and I'm sure there are lots of subtle differences, but basically code that deals with MARC Communications Format records can read the UNIMARC records.  I suppose this is common knowledge for many, but I was surprised.  Has anyone tried to document what the differences are, especially in fixed-field elements?

--Th

WorldCat.org

Fred Kilgour

Fred3 'Mr. Kilgour' was very much a presence at OCLC when I joined it in 1977.  I didn't have a lot of contact with him before he retired, but I do remember him coming over to Development and sitting down with programmers trying to get ILL to work the way he wanted it to.  After he 'retired' we were both working on electronic publishing projects, so I got to know him better.  One of the amazing things about him was that he was 53 years old when he founded OCLC (57 when OCLC went online) and stayed active in the field for another 40 years.

One more anecdote about Fred, again about ILL.  The story goes that he asked the staff to figure out how much it would cost to process an ILL transaction.  So they tried to estimate the storage needed, how many I/O's would be done, memory, and cpu time, and came up with a figure of something like $3/ILL (computers were expensive in the 70's).  'Too high' said Fred and set the price at $0.95.

One of the main things I remember about Fred was his stubbornness.  He had to be about the most stubborn intelligent person I've ever meant.  He was the perfect combination of vision and stubbornness that was needed to get OCLC going when, given the level of technology and state of libraries at the time, it was just about impossible.  But he did it.

--Th

PURL news

Purlearth The PURL server is scheduled to go down at 3:30 p.m. EDT this afternoon (August 1st, 2006).  With a little luck it will be back up a half hour later on a new server.  We are migrating from an old Sun machine to a new Linux server which should have more room for log files as well as be easier to maintain.

We apologize for the outage, but need the time to get the latest log files transferred.  The ability to create new PURLs has been off for a few days while we were getting ready for this move, and should return this afternoon.

--Th

My Photo

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31