VoID The 2012 October VIAF dataset is now available.
http://viaf.org/viaf/data describes and links to the files involved and describes how we expect the ODC-By license to be applied. Some of the files are larger than 7 Gigabytes; if the site appears slow, please stop downloading and come back later. See an earlier post (http://outgoing.typepad.com/outgoing/2012/05/viaf-dataset.html) for more information about how the files are distributed.
These files (dated 20121101, the date we created the clusters), are the first ones available since July. We skipped August altogether as we concentrated on moving the VIAF database creation to Hadoop and new hardware. Although we updated VIAF online in September, that was still done using our old platform, so the current system at http://viaf.org and this dataset is the first produced on the new platform.
We think the database is good, although we have seen some missing matches between corporate names and some personal names with single dates. That should be corrected later this month (November 2012).
For those interested in the technology behind VIAF, we previously used a home-grown version of map-reduce (written in pure Python) for our processing. Now we have moved to a cluster which is set up to run Hadoop, and we have migrated all our processes to the new machine, which has 2-3 times the cycles, disk and RAM that our old cluster had. With that, plus some refactoring and more use of map-reduce, we are able to build VIAF much more quickly, we think pretty much overnight once we get our build process fully automated.
There have been a number of changes in the processing, some of which might be visible. Let us know if something looks wrong/changed/dumb and we'll see what we can do.
One of the largest changes in the October 2012 build is that Wikipedia names are treated very much like names from VIAF's other source files. Another is the presence of nearly a million names from the National Diet Library of Japan.
Update 2012.11.15: We inadvertantly left out the DBPedia links that were in the RDF view of the clusters (see some discussion in the comments). As of today the viaf.org/viaf/data file points at an updated copy of the RDF dump which has DBPedia links in it.