I thought some people might be interested in the infrastructure that supports VIAF.
There are two distinct aspects of it. The first is the transformation of the raw data into a database and the second the online system that users interact with.
Essentially all the code that gets the data ready to be indexed is written in Python. Other than the standard Python (2.7) it is all our own code, no added libraries. Most of the processing is done using map-reduce using a system written by Jenny Toves (in Python of course), who also wrote much of the matching code for VIAF. We started using map-reduce shortly after Google wrote about the technique, gradually improving and extending our system as VIAF and WorldCat processing got larger and larger.
However Hadoop has overtaken us! We are in the midst of moving to a new machine set up for running Hadoop, porting the VIAF build software to it along with some FRBR processing. Most of the code will remain in Python, but will be using streaming map-reduce support of Hadoop, which is really quite nice. In addition to becoming a standard part of OCLC's production software, Hadoop has a lot of nice job control and monitoring features that are more comprehensive than we were ever able to add to our home-built framework, and it gives us access to HBase.
The current machine we build VIAF on is a 33-node Beowulf cluster with 132 cores, 4GB of RAM/core and about 1.5TBytes of disk/node. When we first got it a few years ago that seemed like a lot of disk, but we have managed to use most of it. The new Hadoop machine has 40 8-core nodes plus a name node with 12 cores for a total of 332 cores. Each of the 40 8-core machines has 3 2TByte disks and 32GBytes of RAM, so it has several times the capacity of the old cluster, including 10 times the network speed, going from 1 Gbit to 10. The old cluster disk was fully raided, but the new cluster depends mainly on HDFS for reliability (3x replication).
We could probably process the 120 million bibliographic and 30 million authority source records into VIAF clusters in about a day's processing on the old machine. Normally that gets spread out over several days and has to compete with other jobs.
One of longest processes in building VIAF is the final indexing stage that right now runs as a single processes. As we move to the new Hadoop machine, we are using Hadoop to parallelize that, removing a bottleneck.
The online system runs on a standard Linux box. Over the next few months we expect to move that to a more robust production environment with better monitoring, support and redundant servers. The indexing and server software is all Java, running in Apache Tomcat. We use Pears, an open source OCLC indexing engine written by Ralph LeVan. Ralph has added linked data support, multi-lingual options and AtomPub support to what is a fairly thin layer of URL rewrite rules and XSLT scripts on top of an SRU system, also built by Ralph.
--Th