« August 2005 | Main | October 2005 »

WorldCat holdings

Worldcat_logo The last few days we've been looking for some patterns in how holdings change in WorldCat records.  As libraries catalog materials on OCLC they can indicate that they 'hold' the item.  The total number of these holdings just passed 1,000,000,000.

Part of the justification for this was to see if we could do a better job at selecting a particular manifestation of a work-set to represent the whole set.  This is sort of a 'poor mans' FRBR, and what we are currently doing this for the records that Yahoo harvests from us--each one is the highest held record of one of the 3,000,000 most popular works in WorldCat.  It's not perfect, but it was quick, and works surprisingly well.  Once you find a record in Yahoo (or Google or anywhere else) the Open WorldCat pages have links to other records we have identified as being part of the same work-set.

We have counts for the number of holdings for the records we had seven years ago in 1998.  61% of the records now in the database have gained holdings since then, 4% have lost holdings, and 35% of the record holdings counts remained the same.  Just looking at the changes in the records from 1998, excluding more recent records, the figures are: 42% gained holdings, 6% lost, and 52% remained constant.

Interesting to note that the highest OCLC number in 1998 was just under 40 million (as opposed to more than 60 million now). Over the last eight years, WorldCat has grown by 50%.  Nearly 20% of WorldCat's 1,004,000,000+ holdings are on records entered in the last seven years and about one-third of all holdings have been added in that time (WorldCat went online 34 years ago).

One of the things we looked at are the records with the most holdings lost.  Of the 100 records that lost the most holdings, only 2 were not the result of the record being merged with another (we do this regularly as part of our duplicate detection process).

We had the idea that the record in a work-set that has gained the most holdings recently might be a good record to point to.  It turns out that most of these are electronic versions, typically from NetLibrary.  This is great if your library has a NetLibrary account, but otherwise not particularly helpful.

Thanks to Ed O'Neill and Rick Bennett who saved the holdings data in an accessible form, and Jenny Toves who processed the data.

--Th

Adirondack Loj

Adirondackloj Looking at Cutter's Objects, got me looking through the Wikipedia at Charles Cutter, and then, of course, his contemporary Melville Dewey.  Or Melvil Dui as he later called himself.  Dewey had lots of interests, including spelling reform and constructed languages.  Wikipedia claims that Dewey owned Adirondak Loj, which is undoubtedly true. 

I once wrote a report about library hand, a nicely rounded style of hand writing appropriate for library catalogs.  It turns out that Dewey was one of the main proponents of library hand.  It was still on the curricula of some library schools into the 1940's, and I can remember seeing library cards that faithfully used the style Dewey suggested.  Library Hand would make a good name for a blog.  Here's a nice essay about card catalogs that mentions it.

So, that prompts the question: how many 'Dui-isms' are there still out there?  Wikipedia claims Americans write 'catalog' instead of 'catalogue' because of Dewey, and hundreds of thousands of libraries use the Dewey Decimal Classification, but there must be many other more obscure traces around.

Maybe 025.431:The Dewey blog knows.

--Th

Parallel Text Searching

Beonews Ralph LeVan, Jenny Toves and I recently published an article in D-Lib Magazine on parallel text searching.  This is mostly Ralph's work, looking at how we could get fast (>100 searches/second) searching on our 60 million record database using commodity hardware.  As a bonus, Ralph was able to get it working using standard protocols.

The trick, of course, is dividing up the work to run across our 24-node/48-cpu Beowulf cluster.  Dividing up the searching wasn't too hard, although making sure that ranking can still be done quickly across the pieces takes some thought.  It took a little more work to figure out how to divide the workload in aggregating the results of those searches (we ended up with two levels of aggregation).  Another lesson was that, with care, XML can be used as an internal protocol.  Some of the care was to use SRU to avoid the extra XML overhead of SRW.

This is all written in Java and available (in its current incomplete state) as open source software.

--Th

Related entry: Mapreduce and web services

Why our catalogs don't work

Carddrawer I blame the 1961 Paris Principles.  Here's the relevant section:

2. Functions of the Catalogue

The catalogue should be an efficient instrument for ascertaining

2.1 whether the library contains a particular book specified by

    (a) its author and title, or

    (b) if the author is not named in the book, its title alone, or

    (c) if author and title are inappropriate or insufficient for identification, a suitable substitute for the title; and

2.2 (a) which works by a particular author and

    (b) which editions of a particular work are in the library.

Notice there is no mention of subjects, or formats, or navigation, or obtaining resources.

Of course, I'm not the only one to notice this.  The Statement of International Cataloguing Principles by the IFLA Meeting of Experts on an International Cataloguing Code (I'm looking at a January 2005 draft), expands on these quite dramtically, adding in FRBR and putting subjects back in (Cutter had subjects and more in his Rules for a Dictionary Catalog in 1904, see footnote).  Unfortunately, by the time you add in FRBR and subjects, and navigation, and formats, the text gets pretty long.

It's fair to say that we have catalogs that perform the functions laid out in 1961 admirably.  There's no way of knowing this, but I would guess that the OPACs and union catalogs of the world are used to do more searches than the equivalent card catalogs ever were.  In many ways our catalogs are a success.  So why do we feel so bad about them?

The simple answer is that we are asking our catalogs to do things they weren't designed to do.  They were designed to provide access to a library's collection, most of which was within a short walk of the catalog.  That's not the case today.  Now our catalogs are being coordinated, collated and collapsed into union catalogs of all types, and are trying to offer access to remote materials.  Even the library's own catalog is, more often than not, being used remotely.  Some of the problems stem from the union catalog aspects, but the main problems are the remoteness of our users and changes in their expectations and experience.

For remote access to work we need to be more like Amazon.  We need reviews, cover art, and access to at least some of the book, such as tables of contents.  Lots of people have noticed this and there have been a number of proposals.  Pauline Atherton Cochrane was talking about this in the 1970's, but we don't seem to be making much progress.

Our audience is more of a problem.  People used to have to use the catalog to find something, and many of them would eventually get past the initial trauma of bibliographic control, and learn to appreciate what we were doing to/for them.  That's not true now, people have other options to find information, so more and more people never make that jump.

I'm convinced we can do much better, and that we'll need to if we want to reverse the 'market share' loss we are starting to see.

--Th

Continue reading "Why our catalogs don't work" »

Quick searching -- the movie

movie When the new version of Google desktop came out a couple of weeks ago I thought it would be interesting to see whether there was some way a bibliographic file could be integrated with it for quick searching, somewhat akin to how Art Rhyno is trying to make his library's catalog part of the PC environment.

I took the 3,000,000 records we exposed to Yahoo (each of which is the most commonly held record for the top three million works in WorldCat) and indexed all of the common phrases (up to five words) in the author, title, and subject headings.  This results in an index that can fit into main memory, along with a short citation from each record.  This is up under a Web server, and as you type the results are displayed.  The citations link into Open WorldCat, so you can locate the item, see subject headings, and other editions.

I didn't really get to any integration with Google Desktop, as I started exploring the possibilities of quick response from a web server, but the index is quite compact so there's still the possibility of doing that with the indexes, although a fast web service would probably be easier to integrate.

We don't have a public version of this yet, so I made a movie.  Using Windows Media Encoder 9 (free from Microsoft and I only had to reboot my machine 6 times before I got it all installed and working).  It probably should be done slower and really needs a voice-over explaining what is happening, but I thought I'd put it up here and get some reactions.  The format is Windows Media Video (WMV).  Hope that doesn't stop too many people from viewing it.  Suggestions for more appropriate formats are welcome.

We're not sure where this approach to bibliographic search is going, but I find the almost instantaneous feedback to be so compelling that I can't stop working on it.

Thanks to Lance Osborne who helped with screen design (well, he really did the whole thing).  Also thanks to Cliff Snyder who wrote the JavaScript which watches for pauses in typing.

--Th

FRBR and thematic indexes

Mozart11

Many bibliographic records about music will contain a thematic index number or code that helps identify the piece described.  One of the best known of these are the 'Kochel numbers' (usually abbreviated K.) for Mozart's compositions.  We recently spent a few minutes looking into whether recognizing these numbers would help in our FRBR work-set algorithm.

In WorldCat we found 41 uniform title fields (MARC tag 240) that contained a mention of K. 594.  Here's the list, preceded by the count of the number of times each of the fields was found:

  • 18 =240  10$aAdagio und Allegro,$mmusical clock,$nK. 594,$rF minor;$oarr.
  • 7 =240  10$aAdagio und Allegro,$mmechanical organ,$nK. 594,$rF minor;$oarr.
  • 5 =240  10$aAdagio und Allegro,$mmusical clock,$nK. 594,$rF minor
  • 2 =240  10$aAdagio und allegro,$mmechanical organ,$nK. 594,$rF minor;$oarr.
  • 2 =240  10$aAdagio and Allegro,$mmechanical organ,$nK.594,$rF minor;$oarr.
  • 1 =240  10$aAdagio und Allegro, $mmusical clock,$nK. 594;$oarr.
  • 1 =240  10$aAdagio and allegro,$mmechanical organ,$nK.594,$rF minor;$oarr.
  • 1 =240  10$aMinuets,$mpiano,$nK. 594a,$rD major;$oarr.
  • 1 =240  10$aFantasia,$mmusical clock,$nK. 594,$rF minor;$oarr.
  • 1 =240  10$aFantasien,$nK594,$rF minor;$oarr.
  • 1 =240  10$aAdagio & allegro,$mmechanical organ,$nK. 594,$rF minor;$oarr.
  • 1 =240  10$aAdagio und Allegro,$mmechanical clock,$nK. 594,$rF minor;$oarr.

Currently, the algorithm uses the full NACO-normalized text of subfields amnpr of the 240 field.  Minor variations, such as & or und for and will separate records into different work sets.  If I'm counting right, I found 7 different titles in the list above (including the K. 594a which is probably a different work).  If we ignored everything but the thematic index when we found it, the list would collapse to two (K. 594 and K. 594a), which seems a lot closer to correct.

We’ll need to look at this some more before we make any changes in our algorithm.

Thanks to Eric Childress who's been suggesting this for some time, Jenny Toves who pulled the data, and Jay Weitz, cataloging expert, who looked at some of this.

--Th

Update 2005-0908: The 'thematic index number' link now points to a 2002 version (rather than the '96 version).  Both were prepared by Lois Kuyper-Rushing.

Trump and more trump

TrumpShould cataloging change to make it easier to group bibliographic records properly?

Satham Sanghera, writing in the Financial Times, has an article today about the ubiquity of Donald Trump:

After Trump Tower, Trump Parc, Trump Place, Trump Palace, Trump Plaza, Trump International, Trump National, Trump Marina, Trump Taj Mahal, Trump World, Trump Ice and "Donald Trump, The Fragrance", it was inevitable, I suppose, that eventually we’d get Trump University.

Another thing that Mr. Trump has done is to put his name in front of the titles of a series of business books.  Here's a screen shot from an experimental interface to Open WorldCat after typing 'trump':

Trumptop

Unfortunately, each of the books in Trump's series is typically cataloged with just Trump as the short title (245 $a).  There is also an author/title authority record for Trump, Donald, ‡d 1946- ‡t Trump. ‡l Russian. This is virtually the same pattern we see for works like Shakespeare's Macbeth, which allows our FRBR work-set algorithm to ignore the wide variety of subtitles and pull many of the Macbeth's together as a single work.  Trump's books, however, are clearly different works and should not be brought together.

The problem, from the algorithm's viewpoint, is that the main author and short title are the same for these different works.  This causes other problems, for example when searching for Donald Trump in OCLC's cataloging interface Connexion, you get a long list, many of which are just identified with the title Trump:.

Here's a list of things we might do:

  1. Always use the full title for matching.  This would mistakenly break up many thousands of works.
  2. Use the full title if the short title is a single word.  This would break 'Macbeth'.
  3. Have a table of over-rides so we can ignore that particular author/title authority record.  This might be the easiest to implement, but seems sort of messy.
  4. Add a rule that when the short title ends in ':', use the full title.  This might work, but we'd have to try it and see what matches it affects.  Also, punctuation use is notoriously inconsistent.
  5. Get catalogers to ensure that the main entry/title combination is relatively unique to the work.  I think this has some merit, but sounds like an uphill battle.
  6. Add authority records for each book in the series.  This would work and is consistent with the idea of extending the authority file to control the grouping of works.  It would require a fair amount of work, though.
  7. Allow people to note problems like this in the (forthcoming) WorldCat Wiki, letting them to override the machine groupings.

Anyone think of other approaches?

--Th

Continue reading "Trump and more trump" »

My Photo

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31