Jenny and I have been looking at differences in the WorldCat.org FRBR clustering and the clustering we do here in Research. Ideally we'd like them to be the same, but we keep fussing with ours in Research, so we knew there would be differences.
Comparing two large sets of clusters isn't all that straight-forward. With two implementations there are times when the clusters are the same, one is just larger than the other, or the very common case when each of the clusters has records that the corresponding cluster doesn't, so several clusters may be involved.
One of the first things we looked at was clusters from WorldCat.org that were larger than ours. We have noticed several differences that imply some different title processing, one of which was a difference in how leading articles are handled. In general, when comparing a title that has a skip indicator (see note below) we drop the characters indicated in the indicator before comparing titles. In addition we drop 'the' and 'an', two common English articles. And yes we know 'the' isn't an article in French; we still had to do it. We don't automatically skip 'a' because that is too commonly not an article (e.g. 'A A A') The skip indicators in WorldCat are pretty reliable, but recognizing common articles at the beginning of titles that have not been manually skipped is still worth the trouble. Comparing the clusters, we noticed in at least one case where it looked like WorldCat.org made a title match by dropping 'die' from a title, so we wondered if we could improve our handling of leading articles.
So Jenny ran a scan of some WorldCat records counting all the text people have said to skip in the title fields we are interested in (245, 242, 130, and 740). Here's a table showing the patterns that occurred more than 30,000 times in a 1/10 sample of WorldCat:
783,372 the
257,879 a
153,742 die
101,760 la
74,830 an
62,483 der
55,712 le
50,348 das
46,585 les
43,721 de
42,217 l
32,548 el
That looked promising. It would be nice to use a table derived from WorldCat to control this rather than an ad hoc table. Next we looked at combining the language of the manifestation with the text skipped:
740,712 the (eng)
233,475 a (eng)
131,703 die (ger)
71,963 an (eng)
54,740 der (ger)
49,229 la (fre)
47,127 le (fre)
44,690 das ger)
43,046 les (fre)
37,745 de (dut)
32,580 l (fre)
31,066 la (spa)
30,494 el (spa)
Now this starts to look like a useful list. We're going to try using it and see if it helps. Again, we can't really use a in English, and probably not l in French, so we'll probably just throw out the single letter values. To give you some idea of the number of titles the would affect, with the list above (less a) we found about 12,000 titles changed in the 10% WorldCat sample, including 4,522 the (eng), 826 el (spa), and 782 le (fre). So, around 120,000 titles would be affected in all of WorldCat. I'd class that as a minor improvement, but probably worth doing.
Of course, it's interesting to look at some of the less commonly skipped strings:
euro 3 times
of the 4 times
1:00 AM 3 times
plus lots and lots of numbers: cardinal, ordinal, and Roman.
--Th
Notes: For those not familiar with MARC21, some title fields carry along a single digit indicator which tells how many characters should be dropped from the beginning of the title for sorting (called nonfiling characters in librarian).
The three letter language codes are defined in the MARC21 documentation. It would be nice if they were the same as in ISO 639-2 Codes for the representation of names of languages-- Part 2: alpha-3 code, but nothing is quite that simple, even though LC maintains both.
Update (Feb 6, 2008). Rebecca Guenther at LC pointed out that my note above about ISO 639-2 isn't correct. Here is her explanation: The language codes in ISO 639-2/B are identical to those in MARC. There are 22 languages that have these alternative codes, called ISO 639-2/B and ISO 639-2/T ... . All the other languages in 639-2 are the same. So MARC is the same as 639-2 and in the cases of languages with alternative codes the MARC ones are the 639-2/B codes. The language names are not the same, but that is not what is being standardized-- it is the codes. In other words you don't need the MARC documentation to apply the 639-2 codes, you just need to use the set that is 639-2/B in these 22 cases.