I've mentioned before experimenting with a feature similar to Google-Suggest. There are AJAX issues in getting the information displayed properly, but first you have to have the information ready to send. The processing is fairly simple, but it turns out there are some tricks in calculating it that make both the calculation tractable and the resulting data structure small enough to load into main memory.
The target data structure is a hash table indexed by a string that returns a list of possibilities in ranked order that can be sent to the browser, pretty much instantly. Since we're working with name authorities that have quite a bit of structure, we need to index them with a normalized form of the name, but return a properly formatted name (upper/lower case, diacritics, and even subfield delimiters for authority work).
So here are the data flow stages:
- MARC records (e.g. the 60 million in WorldCat)
- Normalized form of the name + original form of name
- Copy of stage 2, sorted on the normalized forms
- Count of all matching normalized forms + most popular original
- A set of groups, one for each possible substring, with an ordered list of the most-popular-original headings
For normalization we use NACO normalization. The original form of the name is carried as UTF-8 Unicode, translated from the MARC-8 text in the bibliographic records. The collapsing of step 4 is straight-forward. Step 5 is a little more involved.
The input to step 5 has three fields in each input line:
- a count (for ranking)
- the normalized form (for matching)
- the original form (for displaying)
with the lines sorted by the normalized form.
The process for generating the groups is to go through the sorted list heading by heading, and then within each heading create a group for each substring of length 1,2,... starting from the front of the term. Say the first normalized term is 'alfred'. We will need entries for 'a', 'al', 'alf', etc. For each one of these we need to scan through the entries looking for terms that start with that substring.
There are several tricks that reduce the computation. One is that with 'a' you can stop when you hit a heading starting with 'b'. Another is that with 'b' you don't have to start with the a's, you start with the b's and stop at 'c'.
Another short-cut is that as the strings get longer, the number of headings associated with the string gets smaller. Once they go below the number of headings you expect to display, you can stop. The program is just a few lines of Python, but it did take a bit of work to get it right.
Here are some figures from the VIAF file we've been working with:
- 434, 440 records (2 headings/record)
- 623,717 headings after collapsing
- 273,385 groups generated (~90 megabytes)
Even working with all the controlled fields in WorldCat (150 million headings) there are only 750,000 groups in a 150 megabyte file, so this scales up very nicely. Generating the groups on my workstation takes about 80 minutes, but I'm sure it could be partitioned to run across a cluster of machines.
--Th
Update (August 23, 2005): Avoiding the 'hlist[bot:]' list creation results in a dramatic reduction in run time.
Comments