The Virtual International Authority File (VIAF) currently has about 28 million entities created by a merge of three dozen authority files from around the world. Here at OCLC we are finding it very useful in controlling names in records. In the linked data world we are beginning to experience 'controlling' means assigning URIs (or at least identifiers that can easily be converted to URIs) to the entities. Because of ambiguities in VIAF and the bibliographic records we are matching it to, the process is a bit more complicated than you might imagine. In fact, our first naive attempts at matching were barely usable. Since we know others are attempting to match VIAF to their files, we thought a description of how we go about it would be welcome (of course if your file consists of bibliographic records and they are already in WorldCat, then we've already done the matching). While a number of people have been involved in refining this process, most of the analysis and code was done by Jenny Toves here in OCLC Research over the last few years.
First some numbers: The 28 million entities in VIAF were derived from 53 million source records and 111 million bibliographic records. Although we do matching to other entities in VIAF, this post is about matching against VIAF's 24 million corporate and personal entities. The file we are matching it to (WorldCat) consists of about 400 million bibliographic records (at least nominally in MARC-21), each of which have been assigned a work identifier before the matching described below. Of the 430 million names in author/contributor (1XX/7XX) fields in WorldCat we are able to match 356 million (or 83%). If those headings were weighted by how many holdings are associated with them, the percentage controlled would be even higher, as names in the more popular records are more likely to have been subjected to authority control somewhere in the world.
It is important to understand the issues raised when pulling together the source files that VIAF is based on. While we claim that better than 99% of the 54 million links that VIAF makes between source records are correct, that does not mean that the resulting clusters are 99% perfect. In fact many of the more common entities represented in VIAF will have not only the a 'main' VIAF cluster, but one or more smaller clusters derived from authority records that we were unable to bring into the main cluster because of missing, duplicated or ambiguous information. Another thing to keep in mind is that any relatively common name that has one or more famous people associated with it can be expected to have some misattributed titles (this is true for even the most carefully curated authority files of any size).
WorldCat has many headings with subfield 0's ($0s) that associate an identifier with the heading. This is very common in records loaded into WorldCat by some national libraries, such as French and German, so one of the first things we do in our matching is look for identifiers in $0's which can be mapped to VIAF. When those mappings are unambiguous we use that VIAF identifier and are done.
The rest of this post is a description of what we do with the names that do not already have a usable identifier associated with them. The main difficulties arise when there either are multiple VIAF clusters that look like good matches or we lack enough information to make a good match (e.g. no title or date match). Since a poor link is often worse than no link at all, we do not make a link unless we are reasonably confident of it.
First we extract information about each name of interest in each of the bibliographic records:
- Normalized name key:
- Extract subfields a,q and j
- Expand $a with $q when appropriate
- Perform enhanced NACO normalization on the name
- $b, $c's, $d, $0's, LCCNs, DDC class numbers, titles, language of cataloging, work identifier
The normalized name key does not include the dates ($d) because they are often not included in the headings in bibliographic records. The $b and $c are so variable, especially across languages, that they also ignored at this point. The goal is to have a key which will bring together variant forms of the name without pulling in too many different entities together. After preliminary matching we do matching with more precision and $b, $c and $d are used for that.
Similar normalized name keys are generated from the names in VIAF clusters.
When evaluating matches we have a routine that scores the match based on criteria about the names:
- Start out with '0'
- A negative value implies the names do not match
- A 0 implies the names are compatible (nothing to indicate they can't represent the same entity), but nothing beyond that
- Increasing positive values imply increasing confidence in the match
- -1 if dates conflict*
- +1 if a begin or end date matches
- +1 if both begin and end dates match
- +1 if begin and end dates are birth and death dates (as opposed to circa or flourished)
- +1 if there is at least one title match
- +1 if there is at least one LCCN match
- -3 if $b's do not match
- +1 if $c's match
- +1 if DDCs match
- +1 if the match is against a preferred form
Here are the stages we go through. At each stage proceed to the next if the criteria are not met:
- If only one VIAF cluster has the normalized name from the bibliographic record, use that VIAF identifier
- Collapse bibliographic information based on the associated work identifiers so that they can share name dates, $b and $c, LCCN, DDC
- Try to detect fathers/sons in same bibliographic record so that we don’t link them to the same VIAF cluster
- If a single best VIAF cluster (better than all others) exists – use it
- Uses dates, $b, $c, titles, preferred form of name to determine best match as described above
- Try the previous rule again adding LCC and DDC class numbers in addition to the other match points (as matches were made in the previous step, data was collected to make this easier)
- If there is a single best candidate, use it
- If more than one best candidate – sort candidate clusters based on the number of source records in the clusters. If there is one cluster that has 5 or more sources and the next largest cluster has 2 or less sources, use the larger cluster
- Consider clusters where the names are compatible, but not exact name matches
- Candidate clusters include those where dates and/or enumeration do not exist either in the bibliographic record or the cluster
- Select the cluster based on the number of sources as described above
- If only one cluster has an LC authority record in it, use that one
- No link is made
Fuzzy Title Matching
Since this process is mainly about matching names, and titles are used only to resolve ambiguity, the process described here depends on a separate title matching process. As part of OCLC’s FRBR matching (which happens after the name matching described here) we pull bibliographic records into work clusters, and each bibliographic record in WorldCat has a work identifier associated with it based on these clusters. Once we can associate a work identifier with a VIAF identifier, that work identifier can be used to pull in otherwise ambiguous missed matches on a name. Here is a simple example:
Record 1:
Author: Smith, John
Title: Title with work ID #1
Record 2:
Author: Smith, John
Title: Another title with work ID #1
Record 3:
Author: Smith, John
Title: Title with work ID #2
In this case, if we were able to associate the John Smith in record #1 to a VIAF identifier, we could also assign the same VIAF identifier to the John Smith in record #2 (even though we do not have a direct match on title), but not to the author of record #3. This lets us use all the variant titles we have associated with a work to help sort out the author/contributor names.
Of course this is not perfect. There could be two different John Smith’s associated with a work (e.g. father and son), so occasionally titles (even those that appear to be properly grouped in a work) can lead us astray.
That's a sketch of how the name matching process operates. Currently WorldCat is updated with this information once per month and it is visible in the various linked data views of WorldCat.
--Th & JT
*If you want to understand more about how dates are processed, our code{4}lib article about Parsing and Matching Dates in VIAF describes that in detail.