« June 2007 | Main | August 2007 »

Librarianship as computation

Once in a while I'll write a blog post but never publish it.  I found this in my list of unpublished entries, and since we in OCLC Research are in the midst of moving our offices, thought it was worth publishing even though it was written last July.

Gpofcomputing Peter Denning has an interesting column in the July issue of Communications of the ACM v50i7, Computing is a Natural Science by Peter J. Denning (self-archived version) [thanks to Confessions of a Science Librarian for the links].  The main point of the article is that as we understand the world better, we see that computation is embedded very deeply in many aspects of it.  Probably the best example of this is biology which now encompasses the study of DNA encoding, and more and more focuses on the computation that goes on with that information (I wonder how long before we will be able to predict what an organism looks like solely from their DNA?).  Another example is a recent article The Memory Code by Joe Z. Tsien in the July 2007 Scientific American about watching the computation going on in a mouse's brain in reaction to being shaken (be sure to watch the video).

We've seen a similar progression in our field.  As metadata went digital organizations such as OCLC allowed sharing of it to an unprecedented degree (our cataloging service seems close to the 'infinite games' that Denning describes), and we are gradually making the maintenance of the data more and more a computation rather than a human task.  Of course, Denning's point would be that it has always been a computation; what we are doing is understanding that computation and making it run on our computers.

As the source materials become available in digital form, organizations such as Google are doing similar things, although I'm not sure what the parallel would be to the automation of maintenance for the sources; possibly their continued refinement of information extracted from the page images, such as redoing the character recognition and indexing.  For Wikipedia it probably corresponds to more automation of the content.  Currently Wikipedia seems to resist much of that automation, although I suspect that 'resistance is futile' and that it will incorporate more and more automatically selected and edited material.

So, are there any implications for librarianship?  The ones that come to mind are probably obvious:

  • More of our metadata will be automatically extracted from source materials (especially as the digital form becomes available earlier in the process)
  • This will include suggestions for classification and subject headings
  • Authority control of names, etc. has much to gain from automated analysis of both our existing metadata and associated texts
  • Access to the full text of collections is going to change everything

The simpler library tasks have already been computerized.  More and more of them will be.

I had to laugh a bit when I tried to access a site Denning contributes to called Great Principles of Computing when the links on the welcome screen didn't work in Firefox (magnification had caused a heading to block access to them).  There is a lot of information there, but 'usability' isn't the main thrust, nor is 'programming'.  I often hear people complain of the lack of programming skills of computer science graduates, but Denning has long felt that thinking of computing as programming is out-of-date at best.  See his CACM column of three years ago.

--Th

PURL2

Purl150 OCLC has been running the PURL service with only minor outages since 1995.  PURLs (Persistent Uniform Resource Locators) provide a level of indirection allowing the separation of the name of a Web resource from the location of it.  We have made the source code available so that others can run PURL systems, and several organizations do so.

OCLC has contracted with Zepheira to reimplement the PURL code which has become a bit out of date over the years.  The new code will be in written in Java and released under the Apache 2.0 license.  We expect it to be embeddable, opening up many new uses.  We frequently run into situations where an easy way to manage HTTP redirects within an application would be useful, so I imagine others do to.

Eric Miller is the president of Zepheira.  Eric used to work in the Office of Research here at OCLC before taking a job at MIT for the W3C.  All three organizations are excited about the possibilities of the new software and we just issued a joint press release about it.

Our schedule is to have the reimplementation completed this fall.

--Th

Typo of the day

Californianamp There is a blog I look at occasionally called Typo of the Day for Librarians.  Every day they post a new typographical error and talk about how and where it occurs in library catalogs.  Today while debugging WorldCat Identities I ran across what must be a fairly new error, an XML processing error embedded in the data.  In particular the record had & in it (instead of just &).  This happens when a string with an ampersand gets 'escaped' for insertion into XML twice:

  1. The string starts out New York & Pennsylvania
  2. It gets escaped into New York & Pennsylvania
  3. Escaping it again gives you New York & Pennsylvania

So if you look at the raw XML you see &.  On the screen it looks like & (which, confusingly, is what should be stored in the XML). Interestingly enough, WorldCat.org does some sort of magic, so that the public view of the records display correctlyToday WorldCat has quite a few records with this error in them, but I've given the list to bibchange to look at.

If you are interested in standard typographical errors in catalogs another site (associated with Typo of the Day for Librarians) is Typographical Errors in Library Databases.

--Th

Early dates in WorldCat

Claytablet150 The older dates didn't seem to be working properly in WorldCat Identities, so I've been looking at records in WorldCat with early dates.  In our March 2007 copy of WorldCat there are 828 records with BCE dates.  Since there are only 7,500 records with a date of 1450 or earlier, it can be difficult to find Identities records with early dates, especially since many of the very early materials described are things like mummy cloths or coins or tablets without a clear name associated with them for Identities.

For sharing, I made a WorldCat list of the ten oldest items I found.  Here is an edited version:

Type: Visual Material : 3-D object Internet Resource Internet Resource
[between 2655 and 2330 BCE]
Type: Book : Thesis/dissertation/manuscript Archival Material Archival Material
[ca. 3400 B.C.]
by Eugene Augustus Hoffman; General Theological Seminary (New York, N.Y.); St. Mark's Library (General Theological Seminary). E.A. Hoffman Collection.
Type: Visual Material : 3-D object
[ca. 3000 B.C.]
Type: Archival Material
[n.d.]
Type: Archival Material
by Dungi, King of Ur; Edgar James Banks
Type: Visual Material : 3-D object
[ca. 2350 B.C.]
Type: Archival Material
Type: Visual Material : Projected image Archival Material Archival Material Internet Resource Internet Resource
[ca. 2400-2350 B.C.]
Type: Visual Material : 3-D object Internet Resource Internet Resource
[between 4800 and 3300 BCE]
Type: Book
[ca. 3000 B.C.]

We'll have those BCE dates in Identities working properly soon.

--Th

My Photo

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31