Several of us here at OCLC have spent considerable time over the last decade trying to pull bibliographic records into work clusters. Lately we've been making considerable progress along these lines and thought it would be worth sharing some of the results.
Probably our biggest accomplishment is that work we have done to refine the worksets is now visible in WorldCat.org (as well as in an experimental view of the works as RDF). This is a big step for us, involving a number of people in research, development and production. In addition to making the new work clusters visible in WorldCat, this gives us in Research the opportunity to use the same work IDs in other services such as Classify. We also expect to move the production work IDs into services such as WorldCat Identities.
One of the numbers we keep track of is the ratio of records to works. When we first started, the record to work ratio was something like 1.2:1, that is, every work cluster averaged 1.2 records. The ratio is now close to 1.6:1, and for the first time the majority of records in WorldCat are now in work clusters with other records, primarily because of better matching.
Of records that have at least one match, we find the average workset size is 3.9 records. In terms of holdings we have 10.6 holdings/workset and over 43 holdings/non-singleton workset (worksets with more than one record). Another way to look at this is that 84% of WorldCat's holdings are in non-singleton worksets and over 1.5 billion of WorldCats 2.1 billion holdings are in worksets of 3 or more records, so collecting them together has a big impact on many displays.
As the worksets become larger and more reliable we are finding many uses for them, not the least in improving the work-level clustering itself. We find the clustering helps find variations in names, which in turn helps find title variations. We are also learning how to connect our manifestation and expression level clustering with our work-level algorithms, improving both. The Multilingual WorldCat work reported here is also an exciting development growing out of this.
There is still more to do of course. One of our latest approaches is to build on the Multilingual WorldCat work by creating new authority records in the background that can be used to guide the automated creation of authority records from WorldCat, that in turn help generate better clusters. We are applying this technique at first on problem works such as Twain's Adventures of Huckleberry Finn and his Adventures of Tom Sawyer which are published together so often and cataloged in so many ways that it is difficult to separate the two. These generated title authority records are starting to show up in VIAF as 'xR' records.
So, we've been working on this off and on for a decade, but WorldCat and our computational capabilities have changed dramatically and it still seems like a fresh problem to us as we pull in VIAF to help and use matching techniques that just would not have been feasible a decade ago.
While many of us, both in and out of OCLC Research, have worked on this over the years, no one has done more than Jenny Toves who both designs and implements the matching code.