« Ships in WorldCat Identities | Main | Expanding VIAF »


Rosemie Callewaert

viaf.org is down at the moment, but i'm excited to see the results. And nice to meet you again at Elag!

Graeme Williams

There are two records for Stephen King (the correct one, VIAF ID:17224787, and a German stub, VIAF ID:6129138). And five separate records for Richard Bachman, including two in France (VIAF ID:14790462 and VIAF ID:206620), only the first of which is linked to Stephen King. John Scalzi has two separate records (VIAF ID:55326319 and VIAF ID:24906176). Cory Doctorow has two records (VIAF ID:50153866 and VIAF ID:12632978). Ursula Le Guin has two records (VIAF ID:65223351 and VIAF ID:31999015).

Inevitably, some of the duplicates are due to the co-existence of both [name] and [name] - [year of birth]. All authors have a year of birth, so making it optional is just an opportunity for error. Also, some of the duplicate author records mentioned above have no publications.

What I want on every catalog I use is a big red button for reporting errors. You could have one for "merge this record with record N" as well as "this record needs splitting". I assume that some of the reports generated from the button would be in error (false positives, so to speak), but I'm not aware of anyone who's tried it, so you might as well be the first.

Is the error rate for this dataset known?

If you don't know the error rate, I'd recommend taking a sample of perhaps a thousand records and checking to see if they had duplicates or not. You could weight the sample either by holdings or circulation.

Since authors are more likely to be represented by duplicate records if they have more publications, you could also check the top few hundred authors by number of publications. This wouldn't give you an estimate of the error rate for the dataset as a whole, of course.

Thanks for pointing those out. Some of those look like errors, at leas one looks legitimate, and most seem to be there because of our fairly conservative matching process.

Our matching is weighted towards avoiding incorrect matches, our goal being 99.5% of the matches are correct. The flip side of this is that many potential matches get missed, primarily because of lack of information, but sometimes because of conflicts in the matching. One of our 'rules' is that we do not merge records from a single file unless the names are the same, including two dates.

We aren't quite ready to process a large number of error reports, but we do have plans on how to approach that. The nature of VIAF is that there are two primary sources of error: problems in the source files and problems in how we process the source files. Right now we are concentrating on the second of those--changes we can make in our software to correctly reflect the underlying data.


The comments to this entry are closed.

My Photo

October 2014

Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31