« March 2005 | Main | May 2005 »

Flat files

File_cabinet For our processing of bibliographic records with Python, one of the more useful classes we've developed is something we call flat files.  A flat file has a very simple structure.  It is composed of lines separated by line feeds, and each line is a key-value pair, separated by a tab.  The keys are required to be unique and sorted.  A very simple format, but like Python dictionaries, very useful.

We use this structure for loading our authority files into memory for processing.  The whole file is read in as a single string, and then a binary-chop is used to find the data associated with a particular key (well, actually we do a little more than that now).

Continue reading "Flat files" »

Beowulf Cluster

Clustergraphic I wrote a short piece  for the OCLC Newsletter about the cluster we've been using over the last year.

--Th

ERRoLs

Errolscreen As this points out, Jeff Young's ERRoLs are a form of cool URLs, much as PURLs are, another system my group here in OCLC Research maintains.   PURLs basically add the 'another level of indirection' that is one of the basic ways of solving problems in computer science.  You give someone a PURL and when they use it the PURL system resolves it to another URL, allowing a central place to manage the resolution of your URLs.

ERRoLs give you a bit more, however, in that the ERRoL system will do protocol transformations too.

Continue reading "ERRoLs" »

MapReduce

Labs_logo2 Jeffrey Dean and Sanjay Ghemawat of Google have written a paper about a method of processing large data sets they call MapReduce.

Many will be familiar with the functional programming constructs of map and reduce.  Map applies a function against each element of a list to get a transformed version of the list.  For example, in Python, map(chr, [97,98,99]) transforms a list of three numbers into a list containing the equivalent characters:

>>> map(chr, [97,98,99])
['a', 'b', 'c']

It's as if you executed [chr(97),chr(98),chr(99)].

Reduce takes a function and runs it against items in the list, resulting in a single value:

>>> reduce(operator.add, ['a','b','c'])
'abc'

This is the string formed by the operations ('a'+'b')+'c'.  This programming style lends itself naturally to nesting:

>>> reduce(operator.add, map(chr, [97,98,99]))
'abc'

Continue reading "MapReduce" »

"Bamboo and pear"

Pear4 As I was fussing with some samples of flooring made out of bamboo the phrase 'bamboo and pear' came to mind as a phrase that couldn't be very commonly found.  Maybe it's never been used?

I was right about the not common, but a search of Yahoo finds one or two depending on how you count them (Google finds one or three).  Those phrases were always out there before the Web, but now you can find them.  Something as relatively obscure as "bamboo flooring" gets a quarter-million hits in Yahoo.  "Pear flooring" is a singleton in both, though.

Of course it isn't too hard to come up with phrases you can't find in the search engines.  Changing the search to "bamboo and pearwood" doesn't find anything in either Yahoo or Google.

--Th

Sun Grid

Ganglia2 We've been working with a Beowulf cluster for about a year now.  When I heard about the Sun Grid service it sounded pretty attractive.  The reported price was $1/cpu hour plus $1/gigabyte/month for storage.

Our Beowulf cluster cost us slightly more than $100,000 and has 48 cpus.  I estimate that we use around 48 cpu hours/day (that's about enough to run our FRBR algorithm on WorldCat once per day).  At 50 weeks/year, 5 days/week, 50 cpu hours/day that comes to $12,500/year.  WorldCat is around 50 gigabytes, but it takes close to 10 times that for comfortable storage with indexes, extra copies, etc.  So storage would be about $500/month, or $6,000/year.

Total yearly cost would be less than $20,000/year.  That's cheap!  Even if you throw in the occasional run-away process that burns up cpu time for a weekend, out-of-pocket costs should be under $30,000/year.  It looks to us as though, unless you're running your cluster flat-out, buying time would be cheaper, and research work tends to be very 'peaky'.  One week we can't get enough time, the next only a few processes get run.

Continue reading "Sun Grid" »

Sparklines

Chela Intense, Simple, World-Sized Graphics, that's how Edward Tufte describes what he calls sparklines.  Tufte admires the tremendous amount of information letterforms convey.  Having actually worked on some fonts years ago, it really is amazing how the smallest variations affect the look of fonts.

Lorcan says I shouldn't keep talking about the DDC Browser when it's not available to everyone yet, but I couldn't resist a short post about some experiments in increasing the browser's information density.  I don't think I've solved this, but do have a couple of interesting screen shots.

Continue reading "Sparklines" »

ETDMS

Rainroofathens I mentioned the ETDMS (Electronic Theses and Dissertations Metadata Standard) in my last post.  Although here at OCLC we've been involved with it since the beginning, I was never optimistic about it getting widespread support.

Luckily I was wrong about that.  For the 185,000 ETDs we harvest, we're finding over 130,000 of the records have an ETDMS version.  The main thing that ETDMS gives us over simple Dublin Core is more information about the degree being granted, such as the name of the degree, the discipline, and the grantor.  Since the NDLTD is looking into coming up with some standard ways of encoding those fields, I thought it would be interesting to run some quick statistics on what's actually in the fields now.

Continue reading "ETDMS" »

NDLTD

Ndltd This week I spent a day at the World Bank (or the World Bank Group as they are now called) at a Networked Digital Library of Theses and Dissertations board meeting.  I've been attending these meetings for years now, and they continue to be interesting.  NDLTD holds a conference each year and ETD2005 will be in Australia in September.  Here at OCLC we harvest ETD metadata (via OAI-PMH) from around the world and make it available for reharvest.

NDLTD has developed a standard for ETD metadata called ETDMS which we plan on revising over the next year.  There are also versions of ETDMS for German, French, and Brazilian theses based on ETDMS.  The UK has slightly different ideas about metadata for ETDs, but we hope we can make all these work together.

Continue reading "NDLTD" »

3-level server architecture

Ddc3screen_2 As I mentioned in my last post about the DDC Browser, we've gone through a lot of versions.  Partly this was to make the interface work better, but much of it was to get the server processing right.  After an initial text-only prototype, all of the interfaces have used HTML (possibly generated from XML) for display.  This is really the only way to go, and GMail has convinced me that there is precious little that can't be done now in standard Web browsers and ther is little justification for sending out anything but XML from your HTTP servers for an application like this.

Here are the main architectural stages the interface has gone through:

  • Text-only
  • Simple HTML server
  • HTML server using XMLHTTPRequest and JavaScript
  • XML server with XSLT in 4 iframes
  • Pseudo-SRU server in 30+ iframes
  • 3-level server (back to 4 iframes)

Continue reading "3-level server architecture" »

My Photo

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31