The Virtual International Authority File (VIAF) is a project jointly administered by LC, the BnF, the DNB and OCLC. The National Library of Sweden is also a participant, soon to be joined by several other libraries. We currently have about 7.8 million VIAF records built from 9.2 million source records.
The VIAF site has recently had a major overhaul. What you now search are records created from a merge of matching source authority records. Within this record you can see what source records were used to create it, along with cross references and other information gleaned both from the authority records and from associated bibliographic records.
In addition to the VIAF record there are MARC-21 and UNIMARC views of the data. For example, here is the record for Barbara Tillett (one of the main forces behind VIAF) : http://viaf.org/viaf/77390479. By adding a file extension you can get a MARC-21 version: http://viaf.org/viaf/77390479.m21, and a UNIMARC view: http://viaf.org/viaf/77390479.unimarc. The MARC export is new, so any error reports are welcome.
In the not too distant future we expect to offer a 'linked data' view of VIAF.
The whole site is just a thin layer of XSLT stylesheets and URL rewrite rules over an SRU database. Other than some JavaScript, virtually everything else is XSL transforms on XML data returned from SRU queries. The graph is put up using a JavaScript vector graphics package jsGraphics. Mouse overs, etc. are currently working better in Firefox than IE.
We know we have work to do to refine the matching process, but are interested in errors people notice, both names that should have been brought together and those that should not have, as well as interface problems you may have.
--Th
viaf.org is down at the moment, but i'm excited to see the results. And nice to meet you again at Elag!
Posted by: Rosemie Callewaert | April 03, 2009 at 16:18
There are two records for Stephen King (the correct one, VIAF ID:17224787, and a German stub, VIAF ID:6129138). And five separate records for Richard Bachman, including two in France (VIAF ID:14790462 and VIAF ID:206620), only the first of which is linked to Stephen King. John Scalzi has two separate records (VIAF ID:55326319 and VIAF ID:24906176). Cory Doctorow has two records (VIAF ID:50153866 and VIAF ID:12632978). Ursula Le Guin has two records (VIAF ID:65223351 and VIAF ID:31999015).
Inevitably, some of the duplicates are due to the co-existence of both [name] and [name] - [year of birth]. All authors have a year of birth, so making it optional is just an opportunity for error. Also, some of the duplicate author records mentioned above have no publications.
What I want on every catalog I use is a big red button for reporting errors. You could have one for "merge this record with record N" as well as "this record needs splitting". I assume that some of the reports generated from the button would be in error (false positives, so to speak), but I'm not aware of anyone who's tried it, so you might as well be the first.
Is the error rate for this dataset known?
If you don't know the error rate, I'd recommend taking a sample of perhaps a thousand records and checking to see if they had duplicates or not. You could weight the sample either by holdings or circulation.
Since authors are more likely to be represented by duplicate records if they have more publications, you could also check the top few hundred authors by number of publications. This wouldn't give you an estimate of the error rate for the dataset as a whole, of course.
Response:
Thanks for pointing those out. Some of those look like errors, at leas one looks legitimate, and most seem to be there because of our fairly conservative matching process.
Our matching is weighted towards avoiding incorrect matches, our goal being 99.5% of the matches are correct. The flip side of this is that many potential matches get missed, primarily because of lack of information, but sometimes because of conflicts in the matching. One of our 'rules' is that we do not merge records from a single file unless the names are the same, including two dates.
We aren't quite ready to process a large number of error reports, but we do have plans on how to approach that. The nature of VIAF is that there are two primary sources of error: problems in the source files and problems in how we process the source files. Right now we are concentrating on the second of those--changes we can make in our software to correctly reflect the underlying data.
--Th
Posted by: Graeme Williams | April 04, 2009 at 19:14