« March 2007 | Main | May 2007 »

Identities upgrades

MarilynWe've been spending quite a bit of time on WorldCat Identities lately, and have just put up a new version of the database.  Typically we pull a copy of WorldCat every six months for research purposes, but WorldCat has been growing so quickly that it seemed worthwhile to do a quarterly update, so information from WorldCat up to the end of March is now reflected in Identities, as is the authority file information.

Here are some other changes:

  • Corporate names show up in personal identities as related names (and personal names in corporate identities)
  • Corporate name links actually work
  • More Wikipedia links (doubled to 45,000)
  • 600,000 links to the German national authority file
  • Library counts for citations are now how many libraries hold at least one edition of a work, rather than counting each edition at each library separately
  • Timeline colors are more distinct
  • Formatting is more consistent across different browsers

Having corporate names in the related names section for a person looks useful to me.  For example, Lorcan Dempsey's page lists three: UK Office for Library and Information Networking, the Electronic Libraries Programme, and the OCLC Office of Research.  I didn't miss these when they weren't there, but now that they are they look indispensable.

The links into the German Authority file are really just the start of what we would like to do with national authority files (much of it through the VIAF project).  In fact, the latest Wikipedia links used some of the preliminary VIAF files to do the matches.  We are allowing multiple links, so pages like Marilyn Monroe's has a link both to the main Wikipedia page about her, plus one to a page about her death.

Probably the next big change we are planning is to do a more intense FRBRization of the records, bringing more names and titles together, which should help the whole system.  We also hope to experiment with FAST subject headings, possibly showing a cloud of subjects associated with each Identity.

--Th

Relator codes

Lomax João Alberto de Oliveira Lima sent me a message wondering who in WordCat Identities has the most roles registered.  João gave the example of Aldo Manuzio, a Venetian printer, who has six roles listed (or will after we clean up the metadata a little).  It turns out that Alan Lomax can more than double that, with 16 roles (compiler, editor, collector, performer, interviewer, director, vocalist, narrator, recording engineer, writer of accompanying material, arranger, speaker, interviewee, photographer, commentator and librettist).

It is often interesting to know why someone has a particular role associated with them, and since this information is not indexed in the WorldCat.org system, it was nearly impossible to search for.  Now the each of the roles is a link to WorldCat.org with a list of up to 20 OCLC numbers that have the identity in that role.  The number that pops up when you hover over a role is the number of library holdings associated with the items with that role.

Here are some other identities with large numbers of roles:

I was a little surprised that the list has so many well known people.

--Th

New cluster

Pebbles We use clusters of machines to speed up much of our bibliographic processing research here in OCLC Research.  Our current clusters have been invaluable, but we were running into some limits on their capabilities, and last week we started moving to a faster and larger cluster.

The new cluster has 33 nodes (32 computational and one head node), each with two dual core Xeon processors, 16 gigabytes of RAM, 1.5 terabytes of disk and a one gigabit network.  This gives us substantially more processing and memory and much more disk storage than we had before (one of the projects we are planning involves loading a record for every holding in WorldCat, over 1.1 billion records).  One of the best things about having a total of 512 gigabytes on the computational nodes is that WorldCat's 150 gigabytes of XML pretty much stays cached, so we can grep through it in about 6 seconds/node, or about 3 minutes to do a full sequential scan.

Actually parsing the records using Python and building our bibliographic objects takes quite a bit longer, but we can read, parse and do some minimal processing on all 83+ million WorldCat records in about 12 minutes using our Python implementation of map-reduce.

One of the biggest improvements is that it is running 64-bit Linux (2.6.9-42.0.2.ELsmp (x86_64)), using Rocks to control the whole system.    We have more and more programs for which two gigabytes of memory is becoming a real limitation, and having more memory available simplifies programming, especially with WorldCat growing so quickly.

The graphic at the top is a display from the Ganglia monitoring system that comes with Rocks, showing all the cluster's compute nodes running full speed.

--Th

Identity timelines

Henry James TimelineI really like the timelines in WorldCat Identities, and it seems that others do too.  So we're starting to get suggestions.  One of Jakob's was to color the part of the timeline when someone is alive differently than the rest.  Another idea about timelines is to differentiate publications by the identity versus those about the identity.  We've implemented both of those ideas and I'd be interested in reactions.

Much of the credit for the timelines needs to go to J. D. Shipengrover, our Web designer here in OCLC Research, Ralph LeVan who did the first version of the XSLT, and Tom Dehn who has learned more about the differences between Firefox and Internet Explorer than he ever wanted to know.

--Th

My Photo

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31