« September 2007 | Main | December 2007 »

Librarianship as computation

Once in a while I'll write a blog post but never publish it.  I found this in my list of unpublished entries, and since we in OCLC Research are in the midst of moving our offices, thought it was worth publishing even though it was written last July.

Gpofcomputing Peter Denning has an interesting column in the July issue of Communications of the ACM v50i7, Computing is a Natural Science by Peter J. Denning (self-archived version) [thanks to Confessions of a Science Librarian for the links].  The main point of the article is that as we understand the world better, we see that computation is embedded very deeply in many aspects of it.  Probably the best example of this is biology which now encompasses the study of DNA encoding, and more and more focuses on the computation that goes on with that information (I wonder how long before we will be able to predict what an organism looks like solely from their DNA?).  Another example is a recent article The Memory Code by Joe Z. Tsien in the July 2007 Scientific American about watching the computation going on in a mouse's brain in reaction to being shaken (be sure to watch the video).

We've seen a similar progression in our field.  As metadata went digital organizations such as OCLC allowed sharing of it to an unprecedented degree (our cataloging service seems close to the 'infinite games' that Denning describes), and we are gradually making the maintenance of the data more and more a computation rather than a human task.  Of course, Denning's point would be that it has always been a computation; what we are doing is understanding that computation and making it run on our computers.

As the source materials become available in digital form, organizations such as Google are doing similar things, although I'm not sure what the parallel would be to the automation of maintenance for the sources; possibly their continued refinement of information extracted from the page images, such as redoing the character recognition and indexing.  For Wikipedia it probably corresponds to more automation of the content.  Currently Wikipedia seems to resist much of that automation, although I suspect that 'resistance is futile' and that it will incorporate more and more automatically selected and edited material.

So, are there any implications for librarianship?  The ones that come to mind are probably obvious:

  • More of our metadata will be automatically extracted from source materials (especially as the digital form becomes available earlier in the process)
  • This will include suggestions for classification and subject headings
  • Authority control of names, etc. has much to gain from automated analysis of both our existing metadata and associated texts
  • Access to the full text of collections is going to change everything

The simpler library tasks have already been computerized.  More and more of them will be.

I had to laugh a bit when I tried to access a site Denning contributes to called Great Principles of Computing when the links on the welcome screen didn't work in Firefox (magnification had caused a heading to block access to them).  There is a lot of information there, but 'usability' isn't the main thrust, nor is 'programming'.  I often hear people complain of the lack of programming skills of computer science graduates, but Denning has long felt that thinking of computing as programming is out-of-date at best.  See his CACM column of three years ago.

--Th

BnF and the VIAF

Bnflogo There has been quite a bit of activity in the Virtual International Authority File project.  Probably the most important event is that the Bibliothèque nationale de France has joined the Deutsche Nationalbibliothek, the Library of Congress, and OCLC as a principal partner.  We had a meeting at the BnF in Paris in October to sign the agreement, review what has been done, and make plans for the future.  The French seem very enthusiastic about the possibilities.  We have been working with a small sample of records for some time (we still have some unresolved problems incorporating UNIMARC authorities) are eagerly awaiting the rest of the file.

Probably the most visible thing we have done is bring up a prototype that shows the links we are forming between records.  Right now this only shows the LC/NACO and DNB records, but we expect to receive the French file this year (2007).  The next big step, beyond adding the French authorities, is to present a view of the VIAF with merged records, rather than just showing links between existing records.  We have done quite a bit of work towards that and I plan to do some blog entries about it soon.  After doing the merge we plan to expand the system to additional authority files, and now have an agreement that lays out how that will be conducted going forward.

--Th

PURLs

Purlearth Since I've been at least peripherally involved with PURLs since their inception, and now heavily involved in their reimplementation, I enjoyed reading Stu Weibel's latest post PURLy Gates and Gift Horses (from Paris -- is he trying to make us jealous?).

As Stu mentioned we are in the final stages of redoing the code.  After acceptance testing we'll have to work out a transition plan to minimize disruption.  The end result should be (open source) code that both does a better job at what PURLs do now, and is embeddable in systems that need to handle URI indirection (the need is becoming almost universal).

Zepheira.com is doing the programming on PURLs.  But this is Eric Miller's company and they are deeply interested in making PURLs much more widely used and appreciated. 

--Th

Links to WorldCat Identities

Masthead_wcident_enI promised in a recent comment to talk more about how the links work to WorldCat Identities.  As I said before, the mechanism that WorldCat.org uses is an OpenURL, but complicated by the desire to put the new page within a frame.  Here is the equivalent URL without the frame:
http://worldcat.org/identities/find ?url_ver=Z39.88-2004 &rft_val_fmt=info:ofi/fmt:kev:mtx:identity &rft.namelast=Austen &rft.namefirst=Jane &rft.id=info:oclcnum/908814 , slightly edited by adding some spaces so that the lines can break.

If you leave off the OCLC number, the system is no longer sure exactly which Jane Austen you are interested in and will give you a list to choose from.  Here is the same example without the OCLC number: http://worldcat.org/identities/find ?url_ver=Z39.88-2004 &rft_val_fmt=info:ofi/fmt:kev:mtx:identity &rft.namelast=Austen &rft.namefirst=Jane, which should return something like this:

Janedisambiguationcropped

What is returned is really XML with a reference to an XSL stylesheet to transform the XML into the HTML displayed by the browser.  The page has a brief explanation of the symbols and results formatting:

Resultskey

More than one person has suggested that the need for a key to the results indicates a problem.  Although I think it useful (especially for the library-literate) to contrast between controlled and uncontrolled names, I'm less comfortable with the traditional personal/corporate split.  In retrospect it would have been better to ignore the MARC coding on the names and merge the occasional Jane Austen that is coded as a corporate name in with the rest of those coded as a personal name.

Actually there is a much simpler URL which works too: http://worldcat.org/identities/find?fullName=jane+austen.  This is the probably the format we will encourage people to use in external linking.

Although WorldCat Identities uses LCCNs when available there is no reason we can't add indexes to OCLC's Authority Record Numbers (ARNs).  There is essentially a 1 to 1 correspondence between ARNs and LCCNs for names and it is ARNs that are carried along with controlled and linked headings in WorldCat.

Thanks to Ralph LeVan who designed, implemented and explained most of this.

--Th

WorldCat.org and Identities

Masthead_worldcat_betaOver the weekend WorldCat Identities went into production.  Personal authors now have a link to their WorldCat Identity page under the Details tab. The links are not hard links, but actually OpenURLs.  Here is the Jane Austen link from an Emma manifestation (OCLC #908814), with URL escaping edited out for readability:

http://www.worldcat.org/wcpa/oclc/908814 ?page=frame &url=/identities/find ?url_ver=Z39.88-2004 &rft_val_fmt=info:ofi/fmt:kev:mtx:identity &rft.namelast=Austen &rft.namefirst=Jane, &rft.nameinit=J &rft.nameinit1=J &rft.nameinitm= &rft.namesuffix= &rft.nametitle= &rft.date=1775-1817. &rft.name= &rft.birthdate=1775 &rft.deathdate=1817. &rft.arn= &rft.title=Emma. &rft_id=info:oclcnum/908814 &title= &linktype=identitiesLink .

The whole thing gets a bit messy because the Identity page gets embedded in a frame, the same way we frame pages from OPACs.

The pages themselves use LCCNs for their identifier if we have one, so they have nice simple cool URLs, which the frame obscures.  Here is Jane Austen's: http://www.worldcat.org/identities/lccn-n79-32879.  We're still working on a few of these names, so the URLs aren't totally frozen, but they are getting close.

We have plans to document the various links into WorldCat Identities, but it's not too hard to see how the links work since everything is OpenURLs, SRU, XML, and stylesheets.

The research version of Identities is still there at http://orlabs.oclc.org/Identities/.   A close look will reveal some differences in how links work on the pages (more of those in the production version link directly back into WorldCat.org), and the research version makes it easier to search Identities directly rather than through WorldCat.org.  Unless this causes too much confusion we are planning on keeping it up so we can experiment publicly without impact on the production version.

--Th

My Photo

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31