« December 2007 | Main | April 2008 »

What people skip

SkippingJenny and I have been looking at differences in the WorldCat.org FRBR clustering and the clustering we do here in Research.  Ideally we'd like them to be the same, but we keep fussing with ours in Research, so we knew there would be differences.

Comparing two large sets of clusters isn't all that straight-forward.  With two implementations there are times when the clusters are the same, one is just larger than the other, or the very common case when each of the clusters has records that the corresponding cluster doesn't, so several clusters may be involved.

One of the first things we looked at was clusters from WorldCat.org that were larger than ours.  We have noticed several differences that imply some different title processing, one of which was a difference in how leading articles are handled.  In general, when comparing a title that has a skip indicator (see note below) we drop the characters indicated in the indicator before comparing titles.  In addition we drop 'the' and 'an', two common English articles.  And yes we know 'the' isn't an article in French; we still had to do it.  We don't automatically skip 'a' because that is too commonly not an article (e.g. 'A A A')  The skip indicators in WorldCat are pretty reliable, but recognizing common articles at the beginning of titles that have not been manually skipped is still worth the trouble.  Comparing the clusters, we noticed in at least one case where it looked like WorldCat.org made a title match by dropping 'die' from a title, so we wondered if we could improve our handling of leading articles.

So Jenny ran a scan of some WorldCat records counting all the text people have said to skip in the title fields we are interested in (245, 242, 130, and 740).  Here's a table showing the patterns that occurred more than 30,000 times in a 1/10 sample of WorldCat:

783,372    the
257,879    a
153,742    die
101,760    la
74,830    an
62,483    der
55,712    le
50,348    das
46,585    les
43,721    de            
42,217    l
32,548    el

That looked promising.  It would be nice to use a table derived from WorldCat to control this rather than an ad hoc table.  Next we looked at combining the language of the manifestation with the text skipped:

740,712    the (eng)
233,475    a (eng)
131,703    die (ger)
71,963    an (eng)
54,740    der (ger)
49,229    la (fre)
47,127    le (fre)
44,690    das ger)
43,046    les (fre)
37,745   de (dut)
32,580    l (fre)
31,066    la (spa)
30,494    el (spa)

Now this starts to look like a useful list.  We're going to try using it and see if it helps.  Again, we can't really use a in English, and probably not l in French, so we'll probably just throw out the single letter values.  To give you some idea of the number of titles the  would affect, with the list above (less a) we found about 12,000 titles changed in the 10% WorldCat sample, including 4,522 the (eng), 826 el (spa), and 782 le (fre).  So, around 120,000 titles would be affected in all of WorldCat.  I'd class that as a minor improvement, but probably worth doing.

Of course, it's interesting to look at some of the less commonly skipped strings:

euro 3 times
of the 4 times
1:00 AM 3 times

plus lots and lots of numbers: cardinal, ordinal, and Roman.

--Th

Notes:  For those not familiar with MARC21, some title fields carry along a single digit indicator which tells how many characters should be dropped from the beginning of the title for sorting (called nonfiling characters in librarian).

The three letter language codes are defined in the MARC21 documentation. It would be nice if they were the same as in ISO 639-2 Codes for the representation of names of languages-- Part 2: alpha-3 code, but nothing is quite that simple, even though LC maintains both.

Update (Feb 6, 2008).  Rebecca Guenther at LC pointed out that my note above about ISO 639-2 isn't correct.  Here is her explanation: The language codes in ISO 639-2/B are identical to those in MARC. There are 22 languages that have these alternative codes, called ISO 639-2/B and ISO 639-2/T ... . All the other languages in 639-2 are the same. So MARC is the same as 639-2 and in the cases of languages with alternative codes the MARC ones are the 639-2/B codes. The language names are not the same, but that is not what is being standardized-- it is the codes. In other words you don't need the MARC documentation to apply the 639-2 codes, you just need to use the set that is 639-2/B in these 22 cases.

Looking at ISBD

Ifla It's quiet here in OCLC Research with several of my collegues at ALA, and the topic of ISBD Punctuation has come up several times recently, so I thought it would be interesting to learn more about it.  I've been aware of ISBD since it was first issued in the 1970's.  I remember the series of pamphlets describing it, but only recently realized how entrenched it is in AACR2 with the discussions about its place in RDA.  From my point of view it mostly gets in the way of processing bibliographic information, so I've never been very particularly enamored with it.

Update (January 28, 2008): Please forgive my confusion about what is and what isn't ISBD punctuation (see comments).  I suspect I'm not alone in this, but should have known better.  In fact, I think some of the principles of ISBD punctuation (such as leaving double punctuation in) leak into many headings that aren't part of ISBD. In particular, the 110 field is not covered by ISBD.

I suspect that part of my problem with ISBD has been a lack of understanding of the whole relationship between ISBD, AACR2 and MARC21, a lack I suspect many of us have.  Of course I don't actually have to catalog anything, I just try to cope with what catalogers (or would-be catalogers) generate and try to make sense of it.

Although I've never processed records in UKMARC, I've always heard that it avoided ISBD punctuation.  Looking into that I found that UKMARC does have ISBD punctuation, although it eschews it at subfield boundaries (an approach now used by MODS).  Now ISBD punctuation doesn't get in the way too badly if what you want to do with the data is a card-like metadata display.  In fact it is a benefit since the punctuation gives you clues about the meaning of the text even if you can't read the text itself, which can be helpful.  But the minute you want to do something slightly different with the data, the punctuation gets in the way.  Is that period (i.e. full-stop or point) at the end of the text there because of ISBD or because the word before it is an abbreviation?  What were our experts thinking about when they decided to prescribe ambiguous punctuation?

Well, of course they didn't really decide that, and the punctuation looks fairly unambiguous if done correctly.  The trick is to double up, as described in Preliminary Consolidated Edition of the rules:

0.3.2.7 When an element or area ends with a point and the prescribed punctuation for the element or area that follows begins with a point, in order to take into account punctuation for both abbreviations and prescribed punctuation, both points are given.

Although I also found in ISBD(ER):

0.4.7 When an element ends with a point and the prescribed punctuation for the element which follows begins with a point, only one of the two points is given.
e.g. 3rd ed. -

    not 3rd ed.. -

    And then ... - 4th ed.
    not And then .... - 4th ed.

After a couple of readings, that started to make sense.  This doubling seems to always have been there, as I found similar language in the 1974 ISBD(M):

When other punctuation is included, the prescribed punctuation is given, even though this may result in double punctuation.

And, in fact, double punctuation is often found in bibliographic records.  Unfortunately not regularly enough so you can depend on it.

As an exercise I extracted the 9.3 million 110 (corporate author) fields from WorldCat and tried to eliminate the ISBD punctuation.

I aimed for something like 99.5% accuracy, so that only 5 headings out of a 1,000 would have either ISBD punctuation left on or non-ISBD punctuation taken off.  With a table of around 60 abbreviations to look for (found by looking at abbreviations that inappropriately got their tailing period deleted) and a few simple rules, it's not that hard to get pretty close to that for the 110's.

One of the main problems with ISBD punctuation is that it is applied inconsistently.  Combine that with a syntax that is hard to distinguish from normal text and you are guaranteed cataloging that is difficult to process.  In those 9.3 million corporate authors I found some 700,000 subfields that might have had an ISBD 'point' at their end, but my program's best guess was that it actually belonged to an abbreviation at the end of the subfield.

ISBD is certainly an accomplishment in its influence on cataloging practices around the world, and we all benefit by the increase in consistency it undoubtedly brings.  The idea of mixing standard punctuation in with coded and uncoded data though, is dubious at best, and at worst ends up with some very strange displays of bibliographic data.

For FRBR processing, what I'd really like to do is parse the parallel title information embedded in title fields with a complicated set of ISBD colons, slashes and equal signs.  A preliminary look at the 1.5 million 245 fields in WorldCat with equal signs in them is enough to scare almost anyone, but I might give it a try since it may substantially help matching titles of some classes of material.

--Th

My Photo

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31