« June 2006 | Main | August 2006 »

Dlibzoom The July/August edition of D-Lib Magazine is out.  Jeff Young and I have an article explaining the WikiD software Jeff has been working on.  This started out as a tutorial on the fundamentals of OpenURL with WikiD as the example, but morphed into explaining how WikiD fits within the very general framework that OpenURL 1.0 provides.

One of Jeff's latest insights is that OpenURL allows the possibility of using very friendly URLs rather than the more arcane ones that more closely follow how the standard is structured.  This makes it fairly easy to support a wide variety of web services with a clear separation of the fairly generic web service support and the application-specific code.  Without such a framework the resulting web services often share little code because of the tendency to mix application-specific needs into the infrastructure.

--Th

Translations in WorldCat

Japanesewave We occasionally pull information from WorldCat for people with special interests that they can't get it any other way.  Recently we had a request for 'all the records translated from Japanese'.  It turns out that there are something like 34,000 records in WorldCat with an MARC 041 field indicating Japanese as one of the original languages that had been translated into English.  After some discussion it turned out that the requestor is really interested in Japanese literature, primarily into English, French, German, or Russian.  Combining the information from the 041 $a & $h with the Literary Form (008/33; I ignored the 006) probably missed quite a few records that aren't coded quite right, but it did result in a managable file.  Here are the language counts we found:

  3138 eng jpn
  3129 chi jpn
  1146 kor jpn
   561 fre jpn
   391 spa jpn
   177 rus jpn
   117 ger jpn
   109 fin jpn
    80 jpn jpn
    50 ita jpn
    34 vie jpn
    27 pol jpn
    20 por jpn
    20 ara jpn
    16 heb jpn
    12 dut jpn
    10 tha jpn
    10 cze jpn
     9 swe jpn
     7 rum jpn
     7 per jpn
     7 epo jpn
     7 dan jpn
     5 hun jpn
     5 gre jpn
     5 alb jpn
     4 urd jpn
     4 ind jpn
     4 hin jpn
     3 tur jpn
     3 slo jpn
     3 scr jpn
     3 may jpn
     3 enf jpn
     3 egn jpn
     2 ukr jpn
     2 tgl jpn
     2 tel jpn
     2 slv jpn
     2 est jpn
     2 che jpn
     2 bur jpn
     2 ben jpn
     1 unr jpn
     1 und jpn
     1 uig jpn
     1 tag jpn
     1 run jpn
     1 pan jpn
     1 mul jpn
     1 mal jpn
     1 lao jpn
     1 khm jpn
     1 kaz jpn
     1 ilo jpn
     1 ice jpn
     1 hum jpn
     1 gle jpn
     1 bul jpn
     1 baq jpn
     1 arm jpn

I suppose this sort of retrieval is needed so rarely that it's not surprising we don't index the 041 $h, but I've always felt hampered not being able to use the data in the records.  A couple of times we've done databases that included indexing down to all the field/subfield combinations, but nothing like that has come close to production.  Anyone out there that can just do an SRU query that would pull this sort of information up instantly?  I imagine we could write an SQL query against our Oracle database that would do this, but it was easier for me to write a little Python script to do it.

If you're really interested in Japanese translations of literary works, here's the Excel spreadsheet.  The titles are in the 'vernacular' if the record had them: Download jpnTranslations.zip (479.5K).  I didn't run FRBR against this, since it didn't seem to result in a dramatic reduction in the number of records.

--Th

xISBN news

Machine2machine xISBN was down for nearly 24 hours, which gave lots of people problems, including several of us here at OCLC.  It was an upgrade which ran out of space and then went from bad to worse.

The good news is that the updated data is from a just-pulled copy of WorldCat and the software has been extended to handle 13 digit ISBNs, or at least 13 digit ISBNs which start with 978, which I believe all the valid ones currently do.  We went back and forth on how to accommodate the longer numbers and ended up with a fairly simple solution that shouldn't break anyone's code.  If we receive a 13-digit ISBN we silently convert it to the equivalent 10-digit one and proceed from there.

A side effect of doing it this way is that xISBN functions as a 13-digit to 10-digit ISBN converter, since we always echo the sent ISBN even if we don't find it in our table.

We're sorry we were down and hope that what we've done with 13-digit numbers seems reasonable.  Someday real-soon-now OCLC is going to make xISBN a supported service, which should be more reliable (I'd guess our reliability is better than 99%, but not the 99.5% or better such a service really should be).

--Th

VIAF presentation

I gave a presentation about the Virtual International Authority File (VIAF) at ALA as part of the National Libraries program.  The session wasn't very well attended, but I did put some time into the slides (including, of course, cribbing some from Barbara Tillett and Ed O'Neill).

--Th

Birth dates

Ralph LeVan has been enhancing our LC/NACO Name Authority search service again.  Now you can put birth and/or death dates in your search string.  While he was doing that he couldn't resist taking the index he built and charting the birth and death dates.  It turns out that 1947 was a golden year; that's the most common birth date in the authority file.  Here's the data Ralph found:Nacobirthdates_2  

I thought we would get a different answer if we looked at the records in WorldCat, but no, 1947 is still the most common year of birth for names in the personal author main entry (MARC tag 100), even when the number of holdings is taken into account (second chart).Authorbirthdates_2

It's tempting to attribute the dips in the chart to World War 1 and 2, but I suppose there may be other explanations.

For those interested, the last chart shows the death dates found in WorldCat authors.  Since many names in WorldCat don't currently have a death date associated with them, the counts are quite a bit lower.  They are also quite a bit more dispersed.  These charts are drawn from the 100 most frequently occurring dates, and the WorldCat numbers are weighted by the number of libraries we show holding the item.   It's not hard to guess who contributed the 1616 death date (Shakespeare), but it's a bit surprising that his birth year (1564) didn't make the top 100.Authordeathdates_2

We still have some work to do on the author matching/ranking, but I think we're getting closer to a very nice service.  We've received some excellent suggestions from people trying to match the NACO file to other services with information about names.  More about that as we get farther along.

--Th

Software contest #2

Gears_1 We announced last February at the Code4Lib conference in Corvallis that we were going to sponsor a Second OCLC Research Software Contest.  We're now ready for submittals, so I'd like to invite everyone eligible to submit.  This year we are mostly looking for Web applications.  All the submittals last year fell into this category anyway, and this way we don't have to install your code, except maybe in our browsers.

The winner gets $2,500 and a trip to Dublin Ohio to visit with us.  One of the first people to sign up this year appears to be from Mumbai, so that 'trip to Dublin Ohio' might be more incentive than we realized.

The deadline is September 15, 2006.

--Th

NACO normalization paper

Nacoscreen_1 For those of you that subscribe or have easy access to LRTS (Library Resources & Technical Services), Volume 50, No. 3 (July 2006) has an article by Jenny Toves, Ed O'Neill, and myself about NACO Normalization on pages 166-172.  (Here is a preprint.)

Our main conclusions are that the first comma in subfield a exception should be dropped and that MARC subfield delimiters should be treated as blanks.  This would result in a few conflicts, but greatly increase the scope of where this really excellent approach to normalization could be employed.

An earlier post has more about this.

Now, on to normalizing non-Roman text.

--Th

Fascinated by remote apps

Youos I'm more and more impressed how much you can now do on other people's hardware.  I've mentioned before that we do some work on a virtual server rented from Openhosting.  Google's Gmail is a popular example of a remotely hosted application.  YouOS takes this to its logical conclusion and gives you a full desktop embedded in your browser.  YouOS is worth playing with for a few minutes, just to see where this sort of thing is going.  I found the applications a little slow to load and noticed a few glitches in the free demo, but there is enough there to get an idea how well a browser-hosted desktop could work and it doesn't look that hard to create new applications.  The technology and number of people able to develop this sort of environment is growing dramatically.

OCLC's Connexion's web interface was an early example of a remote application hosted within a browser.  When we were developing the first version of it during the CORC project in 1998 I remember talking to one of the ILS vendors at ALA describing how it worked and was meant with disbelief until  I demonstrated it in his browser.  Given the types of applications you now see running in browsers the Connexion web interface is looking a little dated, but if you look at the JavaScript involved you'll see it is a fairly complicated application.

Many libraries host their 'local' systems off-site, often in conjunction with a consortium of some kind, and OCLC hosts remote applications like ContentDM.  I think remote hosting and browser access to applications is a clear trend that will continue to grow.  There is a huge amount of redundancy in managing those systems that central sites can reduce.  I think they also can be more reliable.  I've certainly been very impressed with how reliable my gmail account has been.

Another side of reliability is security, and that is one of the barriers to making YouOS work.  Using the web browser within the YouOS desktop to log into Typepad where this blog is maintained probably meant I was sending my password unencrypted.  It seems to me that everything sent to a 'Web OS' like YouOS should be encrypted, but even if it is encrypted at the browser level, I have no idea how secure the information is on the server.  Since the system is new and in 'early alpha' one can only assume 'not very'.

I'm always most impressed with applications that take the least setup.  At ALA I picked up a CD with Google applications on it and took another look at Google Earth.  It installed without a hitch (even though I had an earlier version on my machine), but for what I use it for, I don't see why the browser-based Google maps couldn't do nearly as well.  Granted, it's a little easier to make it work smoothly locally, but Google Maps generally works well enough.  The web applications are rapidly catching up to local applications and for something like Google Earth, even a painless installation just isn't worth it.  Typing in a URL or just following a link in your browser is getting to be all you need to do to use a new application.

Thanks to Bob Bolander for pointing me at YouOS.  There are other similar projects, such as goowy which evidently uses Flash instead of straight JavaScript, and eyeOS which loads faster than YouOS, but I found the interface a bit confusing.

--Th

64-bit Python

Python We've been using the beta version of Python 2.5 on some new 64-bit Linux boxes compiled for 64-bit addresses.  This means no more two gigabyte limits (actually we were seldom able to use much more than one gigabyte before).  The other day we actually created a 4,000,000,000 byte string.  The machines actually have six gigabytes of memory on them, so our attempt at creating an 8 billion byte string (by making the string Unicode) looked like it was going to work, but took longer than we were willing to wait.  Older versions of Python have been compilable as 64-bit, but 2.5 looks like it has fixed a lot of 64-bit related bugs.

Being able to read larger files directly into memory is already changing some of our approaches to name-matching problems.  Four million bibliographic records fit into memory with room to spare.

One of the reasons we started using Python 2.5 before its expected final release in August is that it includes SQLite.  SQLite does have some limitations, but on one of our workstations we have been able to load all 60 million records of a January copy of WorldCat into a table indexed by OCLC number very quickly (about 6,600 records/second).  SQLite runs as an embedded database, that is each application using a SQLite database directly opens the files up (most databases run as servers and incur interprocess communication overhead).  This is also where we ran into one of its limitations: at least on our Linux nodes this doesn't work if the file is NFS mounted.

In general, though, we like it.  SQLite looks like it a reasonable alternative if you want an embedded database, and probably your best alternative if you are also working in Python.  Don't forget to do a 'commit' before you exit!

--Th

OAI harvesting problems

Nsdl_1 Carl Lagoze et al* presented an interesting paper at JCDL 2006 about their experience at Cornell harvesting and integrating metadata of the National Science Digital Library (NSDL) via OAI-PMH.  Some of their problems were certainly predictable.  When metadata is created without standards the result is almost inevitably difficult to deal with.  Just saying 'Dublin Core' helps some, but not enough.

More surprising is their experience in just trying to get the metadata harvested.  They report a 'harvest failure rate' of 25-50% harvesting from 113 collections from about 85 OAI servers.  That has not been our experience.

Here at OCLC Research we harvest metadata about electronic theses for the Networked Digital Library of Theses and Dissertations (NDLTD).  Currently we harvest about 250,000 records from 60 OAI-PMH sites around the world.  We make the aggregate collection available here.  Typically each month we just harvest additions, not the whole site, but occasionally we go back and harvest all the records.  Our failure rate is more like 2-3 failures out of 60, or about 5%, much smaller than NSDL is reporting.

I'm not sure why there is such a difference in our experience and NSDL's.  We don't do a lot of processing of the records, so some of the problems NSDL detects we might not, although subsequent users of the collected data do process it in more detail and we help resolve data encoding problems, so eventually we do see many of these problems.  Another possibility is more uniformity in the servers for electronic theses than for NSDL metadata.

One thing worth noting is that our 'initial failure rate', that is whether a site works the first time we contact it, is very close to 100%.  We nearly always have one or more problems with new sites and it often takes numerous email exchanges to get them resolved, similar to what the Cornell NSDL group reported.

--Th

*Metadata aggregation and "automated digital libraries": A retrospective on the NSDL experience by Lagoze, Carl; Krafft, Dean; Cornwell, Tim; Dushay, Naomi; Eckstrom, Dean; Saylor, John (http://arxiv.org/abs/cs/0601125)

My Photo

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31