VoID The 2012 October VIAF dataset is now available.
http://viaf.org/viaf/data describes and links to the files involved and describes how we expect the ODC-By license to be applied. Some of the files are larger than 7 Gigabytes; if the site appears slow, please stop downloading and come back later. See an earlier post (http://outgoing.typepad.com/outgoing/2012/05/viaf-dataset.html) for more information about how the files are distributed.
These files (dated 20121101, the date we created the clusters), are the first ones available since July. We skipped August altogether as we concentrated on moving the VIAF database creation to Hadoop and new hardware. Although we updated VIAF online in September, that was still done using our old platform, so the current system at http://viaf.org and this dataset is the first produced on the new platform.
We think the database is good, although we have seen some missing matches between corporate names and some personal names with single dates. That should be corrected later this month (November 2012).
For those interested in the technology behind VIAF, we previously used a home-grown version of map-reduce (written in pure Python) for our processing. Now we have moved to a cluster which is set up to run Hadoop, and we have migrated all our processes to the new machine, which has 2-3 times the cycles, disk and RAM that our old cluster had. With that, plus some refactoring and more use of map-reduce, we are able to build VIAF much more quickly, we think pretty much overnight once we get our build process fully automated.
There have been a number of changes in the processing, some of which might be visible. Let us know if something looks wrong/changed/dumb and we'll see what we can do.
One of the largest changes in the October 2012 build is that Wikipedia names are treated very much like names from VIAF's other source files. Another is the presence of nearly a million names from the National Diet Library of Japan.
--Th
Update 2012.11.15: We inadvertantly left out the DBPedia links that were in the RDF view of the clusters (see some discussion in the comments). As of today the viaf.org/viaf/data file points at an updated copy of the RDF dump which has DBPedia links in it.
Some of the files are larger than 7 Gigabytes; if the site appears slow, please stop downloading and come back later.
Sounds like a job for tersaur
Posted by: Ryan Shaw | November 12, 2012 at 12:51
May I ask why the Wikipedia article names are all-lowercase on the VIAF website? Since the links are all right and this display is used nowhere in Wikipedia either, I don't see the problem with displaying the names correctly.
Posted by: AndreasP | November 13, 2012 at 06:09
The lowercase Wikipedia names are a temporary condition. Right now we just used a pull from Wikipedia that was done for background use. We plan a new extraction soon which will address the lowercase and other issues with Wikipedia.
--Th
Posted by: Thom | November 13, 2012 at 08:37
Have the DBpedia links disappeared? Are they coming back? Thanks.
Posted by: Owen Stephens | November 14, 2012 at 11:23
Hi Thom. Unless we're missing something we've just spotted that the DBPedia links have gone from the VIAF RDF. This seems a real shame. It's how we got all the sameas to DBPedia for the locah project (http://archiveshub.ac.uk/locah/) and http://data.archiveshub.ac.uk linked data so was a significant VIAF use case for us. Am I missing something?
Ade Stevenson
Posted by: Adrian Stevenson | November 14, 2012 at 11:29
Oops! I assume we are talking about the rdf file that used to point to DBPedia.
The way we are handling Wikipedia has changed. That change had the unintended consequence of eliminating the DBPedia links. Should be fixable. We will probably put out a new file with a slightly different name with the DBPedia links included.
Thanks!
--Th
Update (2012.11.15): There is a new RDF file with the DBPedia links in it. http://viaf.org/viaf/data now shows it. The online equivalent will be updated soon.
Posted by: Thom | November 14, 2012 at 16:43