VIAF has long interchanged data with Wikipedia, and the resulting links between library authorities and Wikipedia are widely used. Unfortunately we only harvested data from the English Wikipedia (en.wikipedia.org), so we missed names, identifiers and other information in non-English Wikipedia pages.
Fortunately the problem VIAF had with Wikipedia was similar to the problems that Wikipedia itself had in sharing data across language versions. Wikidata is Wikimedia's solution to the problem, and over the last year or two has grown from promising to useful. In fact, from VIAF's point of view Wikidata now looks substantially better than just working with the English pages. In addition to picking up many more titles for names, we are finding a million names that do not occur in the English pages, and names that match those in other VIAF sources has nearly doubled to 800 thousand from 440 thousand.
Since we (i.e. Jenny Toves) was reexamining the process, we took the opportunity to harvest corporate/organization names as well, something we have wanted for some time, so some 300K of the increase comes from those.
We expect to have the new data in VIAF in mid to late April 2015 and it is visible now in our test system at http://test.viaf.org.
The advantages we see:
- Much less bias towards English
- More entities (people and organizations)
- More coded information about the entities
- More non-Latin forms of names
- More links into Wikipedia
This will cause some changes in the data that are visible in the VIAF interface. One of these is that VIAF will link to the Wikidata pages rather than the English Wikipedia pages, and we are changing the WKP icon to reflect that ( to ). This means that Jane Austen's WKP identifier (VIAF's abbreviation for Wikipedia) will change from WKP|Jane_Austen to WKP|Q36322. Links to the WKP source page will change from
http://en.wikipedia.org/wiki/Jane_Austen
to
http://www.wikidata.org/entity/Q36322
Although it is possible to jump from the Wikidata pages to Wikipedia pages in specific languages, we feel these links are important enough that we will be importing all the language specific Wikipedia page links we find in the Wikidata. These will show up as 'external links' in the interface in the 'About' section of the display.
A commonly used bulk file from VIAF is the 'links' file that shows all the links made between VIAF identifiers and source file identifiers (pointers to the bulk files can be found here). The links file includes external links, so the individual Wikipedia pages will show up in the file along with the Wikidata WKP IDs. Here are some of the current links in the file for Lorcan Dempsey:
http://viaf.org/viaf/36978042 BAV|ADV11117013
http://viaf.org/viaf/36978042 BNF|12276780
. . .
http://viaf.org/viaf/36978042 SUDOC|031580661
http://viaf.org/viaf/36978042 WKP|Lorcan_Dempsey
http://viaf.org/viaf/36978042 XA|2219
The new file will change to:
http://viaf.org/viaf/36978042 BAV|ADV11117013
http://viaf.org/viaf/36978042 BNF|12276780
. . .
http://viaf.org/viaf/36978042 WKP|Q6678817
http://viaf.org/viaf/36978042 WKP|http://en.wikipedia.org/wiki/Lorcan_Dempsey
http://viaf.org/viaf/36978042 XA|2219
Lorcan only has one Wikipedia page, the English language one. Jane Austen has more than a hundred, and all those links will be there.
Of course, this also means some changes to the RDF view of the data. We're still working on that and will post more information when we get it closer to its final form.
--Th
A study for Europeana of datasets including Person/Organization names: http://vladimiralexiev.github.io/CH-names/README.html. Conclusions:
- The best datasets to use for name enrichment are VIAF and Wikidata
- There are few name forms in common between the "library-tradition" datasets (dominated by VIAF) and the "LOD-tradition datasets" (dominated by Wikidata)
- VIAF has more name variations and permutations, Wikidata has more translations
- VIAF is much bigger (sec 2.4.2): 35M persons/orgs. Wikidata has 2.7M persons and maybe 1M orgs
- Only 0.5M of Wikidata persons/orgs are coreferenced to VIAF, with maybe another 0.5M coreferenced to other datasets, either VIAF-constituent (eg GND) or non-constituent (eg RKDartists)
- A lot can be gained by leveraging coreferencing across VIAF and Wikidata
- Wikidata has great tools for crowd-sourced coreferencing
I'm very glad of your news above. This means the rift between Wikidata and VIAF will narrow quickly.
- presentation "Wikidata, a target for Europeana's semantic strategy?" (https://nl.wikimedia.org/wiki/GLAM-WIKI_2015/Proposals/Wikidata,_a_target_for_Europeana%E2%80%99s_semantic_strategy%3F) upcoming at GLAM-WIKI 2015
- please participate in https://www.wikidata.org/wiki/Wikidata:WikiProject_Authority_control
#coreferencing works!
Posted by: Valexiev1 | March 27, 2015 at 11:28
This is a great move, and I'm happy to see it happening.
However, I'm not sure it's a good idea for you to store Wikipedia links, once you have the Wikidata ID.
Wikidata IDs should never change (even if two duplicates are merged, one will redirect to the other, so still be valid).
However, if another notable person called Lorcan Dempsey emerges, say a footballer, then the existing Wikipedia page may be moved to, say:
https://en.wikipedia.org/wiki/Lorcan_Dempsey_(librarian)
to allow for the hypothetical
https://en.wikipedia.org/wiki/Lorcan_Dempsey_(footballer)
and the original article:
https://en.wikipedia.org/wiki/Lorcan_Dempsey
would become what we call a "disambiguation" page, listing the others, but not in a machine-readable format.
Unless VIAF is going to regularly scan for such changes, and update its links, it may better for people (or software) using VIAF data to fetch the Wikidata links, then to fetch up-to-date Wikipedia links from WIkdiata.
Reply: We harvest Wikipedia/Wikidata each month, so the links should stay reasonably in sync. --Th
Posted by: Andy Mabbett | March 28, 2015 at 05:57
I'm very excited to hear about this move, and about the increased name coverage. I'm also glad you'll be keeping the article URLs in the links file as well as the data identifiers; although one can theoretically get one from the other, it's much more convenient to have them together. I look forward to hearing when the first enhanced links file is available.
(I also maintain a set of topical subject-article correspondences-- currently in Github rather than Wikidata-- and it's not too hard to keep the article titles in sync after each monthly English Wikipedia dump. I'd imagine it's not too hard for OCLC to keep on top of the article-title changes for names as well.)
Posted by: John Mark Ockerbloom | March 29, 2015 at 15:21
Great to hear about that move!
However, as a user having build an application using the justlinks service, one sentence is scaring:
"Jane Austen's WKP identifier will change from WKP|Jane_Austen to WKP|Q36322"
This will cause our application to break. We will have to check when this happens, and fix it in a hurry. (Thanks for the test environment for preparing this step.)
It would be great if you could provide the Wikidata link with a new WKD tag, while leaving the the Wikipedia links at WKP. Finally, one could argue that these are different datasets.
Cheers, Joachim
Reply: We went back and forth about changing the abbreviation, but decided there would be less confusion (and, to be candid,fewer changes on our end) if we left it the same. Most applications will have to change to cope with the large numbers of new Wikipedia links in any case.
--Th
Posted by: Joachim Neubert | March 30, 2015 at 11:36
Fantastic to hear. Wikidata is becoming more and more the way data can be organized so linking the two should make connecting data across the world that much easier.
Posted by: Matt | April 01, 2015 at 16:55
I didn't know about Wikidata! That is really interesting! I also don't find Wikipedia that reliable, but I still like to use it for most of the things that I need to research! Thanks! Greets!
Posted by: Jayleen | April 15, 2015 at 10:21