Occasionally we extract statistics from WorldCat about the usage of MARC fields. The earliest published version is dated 1981, but there was at least one internal version before that. The program has undergone several translations as our computer systems have evolved. The runtime of the latest Python version was approaching a week as the database got larger, but Jenny Toves has reconfigured it so that it now runs in parallel in less than an hour. The final output is an Excel spreadsheet which makes it a bit easier to look at the numbers.
Many of the figures in the spreadsheet are weighted by library holdings. That gives one a better idea of how fields are used in a typical library than the unweighted numbers, which can be significantly different. Record lengths were calculated as if the record was stored in MARC Communications Format using UTF-8 for the character set.
The Excel file has a Summary table, a table for all the records, and then separate tables for nine different formats (maps, serials, etc.). Here's a brief explanation of the column headings in the file:
- tag -- MARC21 tag
- occ -- number of unweighted occurrences of the field
- prec -- percent of records that have that field
- wtocc -- weighted occurrences
- wprec -- weighted percentage of records with the field
- occRec -- occurrences/record (unweighted)
- lenOcc -- length/occurrence (unweighted)
- sub -- MARC21 subfield code
- subocc -- number of unweighted occurrences of the subfield
- subwtocc -- weighted occurrences of the subfield
- suboccRec -- occurrences/record (unweighted)
- sublenOcc -- length/occurrence (unweighted)
Of course there are lots of different statistics that can be drawn from 80+ million records. Bill Moen analyzed a copy of WorldCat and has published some statistics and conclusions from it (see Assessing Metadata Utilization... and Examining MARC Records as Artifacts...).
--Th
Comments