This isn't actually available yet, but I couldn't help showing a preview: the Open WorldCat Find-in-a-Library page hooked up with Google maps. For this example I did a search in Yahoo for 'McCarthy's Bar', went to the Open WorldCat page, and then clicked on a bookmarklet that shifted into the map view from the standard view.
Wish I could claim responsibility, but this came from Mike Teets (head of our Product Architecture and Development groups). It should be available soon here. This sort of thing is possible to do quickly because of the open way that the Google maps is implemented as web services.
Although the gmaps bookmarklet isn't available yet, the Product works site has some useful browser extensions for Open WorldCat that have been released.
--Th
Continue reading "More loosely coupled" »
A theme that keeps cropping up is the idea of keeping systems loosely coupled. Loosely coupled systems allow pieces to be written with less interaction between developers. Reading things like Wikipedia's entry on Loosely Coupled might make you think that this is something new, but I remember developers at OCLC talking about it 20 years ago. An example at OCLC is the use of Z39.50 as an internal protocol, something we still do to a certain extent. The use of an established standard interface (even better, an interface we didn't have to develop ourselves) worked very well for us. The Web has made everyone more aware of this, however. For example, the link above to Wikipedia will continue to work as long as the system is active and there is an entry called 'Loosely_Coupled' is there. Lots of other things can and will change, from my browser to the Wikipedia wiki software, but that link should still work.
In addition to the xISBN and authority services I've mentioned in previous posts, we've started some experiments using IE's 'research pane' to expose terminologies.
Continue reading "Loosely coupled" »
The (first annual) OCLC Research Software Contest has a winner! Dazhi (David) Jiao of Bloomington Indiana won for his OPAC (sorry if that link isn't working, David is moving it to a more permanent location) that includes a ranked list of harvested citations when a detailed bibliographic record is displayed. The judges (Elizabeth Lawley, Roy Tennant, Jon Udell, and three of us here at OCLC) thought David's submittal showed an innovative way of integrating the OPAC with harvested metadata and that it made good use of open source software from OCLC.
In addition to the honor of winning, David won $2,500 and a trip to Dublin Ohio to visit us.
Continue reading "OR Software Contest" »
One often hears statements like 'transmitting the Library of Congress in 15 minutes' or 'equivalent of the Library of Congress on a key fob.' I suppose I've contributed to that by putting all of WorldCat on an iPod (not hard to do, but the iPod isn't up to interacting with it yet), but how realistic is LC on a key fob?
The Library of Congress is larger than I thought. The site claims 29 million books, 2.7 million recordings, 12 million photographs, 4.8 million maps and 58 million manuscripts! Sometimes people equate a volume to a megabyte (the typical novel is around that), but more realistically, you'll need a scanned image of those pages, around 100Kbytes/page. At 500 pages/volume that gives us about 50 megabytes per volume. At 500 megabytes/recording, 2 megabytes/photo, 5 megabytes/map and 50 megabytes/manuscript I get: 30m x 50mb + 3m x 500mb + 12m x 2mb + 5m x 5mb +60m x 50m = 6 petabytes. This doesn't include video. At 5 gigabytes/video, it only takes 300,000 videos to match the scanned size of all the books, so lets call the collection an even 10 petabytes. This is quite a bit larger than the size people often use, but more realistic.
Continue reading "Entire Library of Congress" »
In a previous post on FlatFiles, I mentioned that we have developed a distributed version using MPI to work with mapreduce. In this model an application first checks to see of the key of the record it wants is in its own address space. If not, it broadcasts a request for the record. Listeners on all the nodes check to see if they have it, and the one that does returns it.
I wondered if HTTP could be used instead of MPI. We always like to use the standard Python libraries when possible for portability and ease of maintenance. In the application I modeled for this, the key is the OCLC number and the data is the associated bibliographic record (currently we have [about] 60 million of these).
Continue reading "Mapreduce and web services" »
Sunday, June 26th, I'll be a speaker at the LITA-ALCTS CCS Authority Control in the Online Environment Interest Group's program XML and Authority Control that Manon Theroux of Yale organized. It runs from 1:30-4:30 pm at the McCormick Place Convention Center, room S405.
I'll be talking primarily about a web service (more information here) that we've been running for a couple of years that gives access to the NACO authority file.
We developed this originally for the JISC funded ePrints UK project. I don't think we ever really got hooked up during this project, but we took the same service and prototyped access from within DSpaces' metadata creation form, which didn't take much code at all. The records come back ranked based on closeness of match to the authority records, and we've done some work on improving the ranking, but that's not exposed yet in the public service.
Continue reading "XML and authority control" »
What school granted the degree for each of the five million thesis and
dissertation records in WorldCat? In MARC-21, this information is
entered in the 502 field, e.g.
>502 Thesis--University of Illinois at Urbana-Champaign, 1976.
This
is useful for people looking at a particular record, but because the
institution name is not controlled and the whole field is fairly
loosely defined, it isn't as much help as it should be for someone who
wants to know how many theses were produced at UIUC in 1976, or for
someone who wants to restrict their searches to theses from the
University of Illinois.
Continue reading "What school?" »
I've had more contact with Elsevier recently than I've had since the Tulip project. Two weeks ago I met with the content/marketing managers of Scirus. They've been using the NDLTD database we maintain of electronic thesis metadata to help them do full text indexing of theses. They had some issues with the data (some of which we've been able to correct) and we had a good discussion about the problems and opportunities here. Here's a search on 'dam safety' in the NDLTD subset of Scirus, and another showing retrieval on text from the PDF file.
I was very impressed with the Scirus team.
Continue reading "Scirus and Elsevier" »