« February 2006 | Main | April 2006 »

Presentation in Krakow

Statedpt The U. S. State Department (the Office of International Information Programs Global Issues and Communications, or IIP/T/GIC) asked me to speak today to a group of information science students in Poland via a video hookup.  OCLC does this sort of thing between divisions around the world, but the State Department arranged for me to use the facilities at a little office near downtown Columbus.  This is the first time I've been on the delivery end of this sort of thing outside more informal OCLC video meetings, and I'd say the technology worked pretty well.  The connection dropped once for a few seconds, but then came back by itself.  The sound quality was good and the video a bit slow, but adequate.  I had some PowerPoint slides, but wasn't able to show them.

The students seemed interested and probably better informed about the state of libraries in the U.S. than I am about those in Poland.  We talked about how changes in 'information architecture' are changing catalogs.  Because of the transmission delay it is hard to have a normal conversation, but it is possible to have a question and answer session, and being able to see the audience makes it much easier to interact with them.  We talked for about 40 minutes and I hoped they enjoyed it.  I know I did.

I thought about putting a picture of Krakow in this posting (where I think the conference this session was a part of is being held), but really there was very little feel of being in Poland, other than the people I was talking with.  As far as the room they were in, it could have been anywhere, and my room was equally anonymous.

--Th

Compact MARC display

Mudlumpsjpn There has been some interest this week on the Code4Lib mailing list (312 subscribers) about displaying MARC-21 records in a 'compact' format.  Some of you will recognize the similarity of the accompaning screen shot to the format used on cards in the past.

While it is hard to defend card catalogs' retrieval capabilities, and the standard 75x125 mm cards (see note below) are a bit of a constraint for display, I always thought catalog cards, especially the typeset ones, were very readable.  They especially excel at presenting a quick view, which made it possible to flip through a group of cards very quickly (something you often needed to do in card catalogs).  I always suspected that card images got a bad reputation because people would set up trials asking users to 'identify the publisher' and then give them a card image with the publisher buried in it and a screen display with an explicit 'PUBLISHER' label in front of the publisher information.

I suspect it would be possible to construct a test where the more compact card image would do much better, and I still miss them after all these years.  In the interest of nostalgia, possibly better bibliographic displays, and a request on Code4Lib, I brushed up something in XSLT I did a few months ago.  Currently there are four files available:

  • compact.tgz--a compressed tar file of the next three files
  • compact.xsl--the XSLT that transforms the XML to HTML
  • compact.css--the CSS file used by the HTML for formatting
  • mudlumps.xml--a MARC-21 collection of 17 records from WorldCat

The mudlumps.xml file has a stylesheet call-out in it so that viewing it with most browsers should result in a formatted display, once you have the .xsl and .css files downloaded.  With IE you might be able to point directly at mudlumps.xml and see a formatted display.

This is, at best, a work-in-progress.  Many fields don't display and there are some formatting errors in the ones that do, but it does show what could be done with a bit of effort.

--Th

Note:  Some people think library card catalogs use(d) 3x5 cards, but they've been 125x75 mm ever since our own Melvil Dewey persuaded ALA at its first annual meeting in 1877 to adopt a metric standard.  Which in turn set the U.S. up for the distribution of printed cards by the Library of Congress in 1898.  Thanks to Leslie Dillon for the information.

Related post: Adirondack Loj

Languages I never used

7altost There are a few programming languages that I 'read the manual,' but never got a chance to try out.  Here are some of my favorites:

  • EL1 Extensible Language 1.  I seem to have misplaced this manual.  The idea was a language that let you extend its syntax.  Probably ahead of its time.
  • ICON A follow-on to SNOBOL.  I once took a course from RBK Dewar, the author of SpitBol, which was a compiled version of SNOBOL.  Writing programs in SNOBOL changes how you think.  ICON was a more modern string-processing language, but it never caught on.
  • Algol-68, a successor to Algol-60.  I used to dream about being able to do 'slices' on arrays!  Plus, everything looked so neat and tidy, and the definition includes quotes from the Mikado.  I had a CS professor at UIUC that was on the committee that wrote the definition.
  • Metafont84, a successor to the original Metafont (Metafont-79).  We did a lot of font work here at OCLC in the early electronic publishing days of the 1980's when fonts were a hard problem for technical publishing (not that that's gone away entirely).  The later Metafont was a lot better (despite sticking to cubic splines), but incompatible with the earlier.  We never made the transition.  Incidentally, the version of Metafont we used was translated into Tandem's TAL language (by hand) from the original SAIL.  I'm still surprised the thing ran, since some of that SAIL code was difficult, to say the least, but of course you don't have to understand code to translate it, you just have to understand what it is doing.
  • DDL Document Description Language.  Imagen Corporation was one of the early companies selling laser printers.  They made their own wet-toner printer (we obtained one by swapping some of our Metafont-79 fonts), and I remember seeing the original Canon engine hidden under a box at their headquarters in California before it was public, before they got their lunch eaten by HP.  Anyway, DDL was a successor to the much simpler imPRESS, trying to incorporate the functionality of PostScript while retaining more of a page description language.  I've still got a manual for it, but never saw it run.

Did anyone ever do any serious programming in these languages?

Then there are 'real' languages that I read the manual to, but never used (but it seems like everyone else has):

  • Lisp--I still don't get the attraction.
  • SmallTalk--The minute I saw a description of the Xerox Alto computers, I knew I wanted one, and SmallTalk was part of what made them so attractive.
  • PostScript--PostScript is a real language and you could do some serious image processing in it, but beyond trying to write an interpreter for it in Ada, I never wrote much PostScript code.  Looking at the language (which is loosely based on FORTH), it was obviously designed to be written, not just generated from by word processors.
  • COBOL--I remember being in library school in 1970 reading a COBOL manual and writing pages and pages of COBOL code for an independent study course.  Unfortunately, the school didn't have any access to a computer, so the code never got close to executing.

I'm sure there are a few more lost in the mists of time, but that's enough for a Friday afternoon.

--Th

Normalizing &

Ampersand2 As I'm sure most of you are aware, the ampersand is a stylized et, the Latin word for and.  This is more obvious in some fonts than others, such as Californian FB (which is based on a Goudy font):

Californianamp_1 , or even more explicit in French Script MT:

Frenchscriptamp Wikipedia has an interesting article about Ampersand.

The NACO normalization rules mentioned previously treat the ampersand as a first-class character and it is retained unchanged.  This causes some problems when trying to bring together Pride & Prejudice and Pride and Prejudice.  The simple thing to do would be to change all examples of the word and into & during normalization (or vice-versa).  One of our questions was whether there are entries in the authority file which differenitiate between headings solely on the basis of & vs. and.

There are at least 36 pairs of records that appear to do so.  Closer examination shows that almost all of these are in error:

n85-270719: =110  1 $aPhilippines.$bGarments & Textile Export Board.

no97-018657: =110  1 $aPhilippines.$bGarments and Textile Export Board

Often one of these will have a 410 cross reference to that alternative form of the name.

There is at least one pair that looks 'correct':

nr2004-018201: =110  2 $aRussell & Jones

nr2001-023662: =110  2 $aRussell and Jones

The first of these has a note:

=667    $aNot to be confused with Russell and Jones, printers in Hartford, Connecticut

In the interests of clarity it would seem that a qualification (possibly by city) would reduce confusion.

Of course there are additional complications in trying to replace & with and.  Sometimes & is used in place of et, e.g. &c. for etc. or & al for et al.  There are also many examples in WorldCat (which has more than seven million records with ampersands) of usage in languages other than Latin and English, such as Kapital & Karma vs. Kapital und Karma.

Jenny Toves processed the authority files to find the heading pairs.

--Th

CORC and cool URLs

Corc3 Lorcan Dempsey's post on Rank, recommend and relate reminded me of the 'Selection Description Access' tag line we used for the CORC project in 1999.  (CORC was an experiment in cataloging electronic resources which morphed into OCLC's Connexion cataloging service of today.)  Doing a Google search for "Selection Description Access" turned up a page created by Roger Brisson while at Penn State with links to pages at other libraries about their CORC experience.  Roger's page has 10 links on it:

  • One uses a PURL to a page at OCLC which now points to a page about Connexion
  • Three other links still work and go to the pages you might expect (good for Cornell, MIT, and U of Washington!)
  • The other six links fail

CorcqualI suppose these are fairly typical numbers for this sort of page, but I thought it was interesting, and another example of how quickly even library created pages and links can fail.  And how important cool URLs are.

--Th

NACO Normalization

NacoscreenProper normalization is crucial to matching strings, and standard normalization is crucial to the interoperability of library authority files, such as NACO's.  Ed O'Neill, Jenny Toves, and I have written a paper (to appear in LRTS) about what we think should be done with the ageing NACO normalization specification.  Although the addition of non-Roman scripts to the normalization process is needed, we restrict ourselves to what can be done to improve the Roman script handling.  Ed identified several characteristics a normalization algorithm should posses:

  • Intuitive
  • Simple
  • Repeatable
  • Generalizable
  • Sortable

Update (2006 March16): Andy Houghton in a comment below adds 'Compatible with XML' to that list, another problem with the current NACO scheme.

The current NACO normalization procedures fail at least the Repeatable and Generalizable criteria, and we offer some suggestions on how to fix them.  We also have a Web site where you can see our normalization routines and some files to help you check to see if your normalization code acts the way we think the current NACO normalization should.

--Th

Innovator's Dilemma

4881358391 As I was trying to describe how we handle matches that are identified outside our standard algorithm for finding FRBR work-sets, I ran into an interesting example that highlights some of the problems involved in making this work reliably with a large, diverse database.  The work-set algorithm relies primarily on author-title keys extracted from the records, sometimes with help from the NACO authority files. Clayton Christensen's  The Innovator's Dilemma is an interesting example because it has a number of title variants and our original xISBN implementation was not commutative-- that is, all the ISBNs in a given group were not guaranteed to return that same group.  The commutative problem (first pointed out by Jon Udell (and given a name by Eric Hellman)) is now fixed in xISBN, but we still have problems pulling together all the versions of a work in WorldCat.  Here's a list of the title variants I was able to find in WorldCat for Innovator's Dilemma (the $ indicates a MARC subfield delimiter):

  • El dilema de los innovadores
  • The innovator's dilemma
  • The innovator's dilemma :$when new technologies cause great firms to fail
  • The innovator's dilemma :$the revolutionary book that will change the way you do business
  • The innovator's dilemma :$the revolutionary book that will changed the way you do business
  • The innovator's dilemma :$the revolutionary book that changed the way we do business
  • The innovator's dilemma :$when new technologies cause great firms to fail
  • イノベーションのジレンマ :$技術革新が巨大企業を滅ぼすとき

It turns out that we actually pull all of these titles together, except for the last one, a Japanese translation.  The string Innovator's dilemma is in the OCLC MARC record for the Japanese translation, but in a 730 (added entry-uniform title).  We don't look at the 730 field when creating work-set keys, it's too general, so we have no way of knowing what the relationship is between it and the main work the manifestation is part of.  Now, the record evidently originated at Waseda University in Tokyo.  If you go to the Open WorldCat page for its ISBN you can actually get into their OPAC, see their record and ask for a MARC display.  They put the Innovator's dilemma string in a 534$t (title statement in an original version note), which makes perfect sense.  We don't currently look at the 534$t, but we could.  Somewhere along the line (probably at Waseda) this got moved into the too general 730$a field, which doesn't do us much good.

So, what to do?  If we got the original record from Waseda we'd be forced to understand their local practice well enough to translate it into MARC-21.  That's a difficult problem, especially since there are thousands of MARC variations in the world.  We might well look through the Japanese records for 730 fields that look like they might be helpful and see if we can do some additional matching.  Or we could wait until someone creates an authority record that pulls these together (something that's not likely anytime soon, since the NACO authority file is exclusively Roman-alphabet), or until someone modifies the record in OCLC to make it more obvious that this is a translation of the Innovator's Dilemma by Clayton Christensen.  For now, the two records in WorldCat for this translation will remain unlinked to the main work-set.

Thanks to Eric Childress and Ed O'Neill who offered advice on this.

The cover art came from Kinokuniya BookWebKinokuniya is OCLC's distributor in Japan.

--Th

Programming Languages--BLISS

Pdp10 A post on Tim Bray's ongoing blog got me thinking about programming languages.  I love programming languages--they're so much easier to pick up than natural languages!  Plus, they keep getting better in one way or another, so there's always a reason to switch once in a while.  And once in a while I do switch.  Over nearly four decades of programming I've written significant amounts (i.e. thousands of lines) of code in at least: Fortran, Snobol-4, PL/I, BLISS, SAIL, PUB, Sigma Assembler,  Pascal, APL, Forth, TAL, C, C++, Ada, Awk, Perl, TeX, Metafont, TCL, Java, JavaScript, XSLT, and Python.  In addition I've implemented interpreters or compilers for Forth, Metafont, Lisp and PostScript, so over the years I've had quite a bit of exposure to several languages at a variety of levels.

Everyone talks about what they do or don't like about Ruby, Python, and Java, so instead I thought I'd occasionally write a bit about languages everyone seems to have forgotten.  I'll start with BLISS, which I used as a research assistant working for Martha Williams in her Information Retrieval Research Laboratory at the Coordinated Science Laboratory at the University of Illinois.

BLISS (Basic Language for Implementing System Software) was a 'high level assembly language' for the PDP-10.  People often say that about C, but that's because they've never seen the real thing.  In BLISS, the machine did what you told it to.  No more, no less.  An identifier represented an address.  If you wanted the contents of the address you put a dot in front of it, so assigning the value of 'b' to 'a' would look something like:

a ← .b

Don't type 'a ← b' unless you mean it; that would assign the address of b to a.  Very simple and straight forward, but there wasn't much (any?) context-sensitive 'syntactic sugar' in BLISS.  This used to drive some people crazy, and they'd give up after creating a mess they couldn't straighten out. On the other hand, you could do things like .(x+.y) to extract the contents of an offset address, or ..x for indirection.  It was very easy to get your dots wrong.

Since you knew exactly what sort of instructions BLISS was going to produce, it was easy to mix in assembler.  This was pretty neat, giving you access to all the bit and byte twiddling instructions the PDP-10 was famous for, and which most languages completely ignored and isolated you from (although BLISS had ways to do a lot of that sort of thing within the language too).

The BLISS I used (and I think the original) was later called 'BLISS36' or 'BLISS-10', since it ran on the 36-bit word PDP-10 (you had a choice on how many bytes were in that word).  There was evidently a 'Common BLISS', but I never had any contact with it.  BLISS-10 made no attempt to be machine independent.

I've still got an official decsystem10 BLISS-10 Programmer's Reference Manual for Version 4 published by DEC in 1974, although it doesn't look used, so I must have been working from an earlier edition most of the time.  I think page 1-1 is worth reproducing:

CHAPTER 1

LANGUAGE DEFINITION

1.1 AN EXPRESSION LANGUAGE

The programming language BLISS-10 enables programmers (persons who converse in a programming language) to construct text (programs) which evoke computations to transform input into a desired result.  Programs written in BLISS-10 consist of declarations, which establish structure, and expressions sequenced to compute results.  Expressions in BLISS-10 can assume remarkably complex forms built up from elementary forms; but regardless of their complexity every expression computes a value. This notion, that a BLISS-10 program consists solely of declarations and expressions, and that these expressions can become arbitrarily complex yet compute a single value during the execution of the program, represents a key concept the reader must understand to properly construct programs in BLISS-10 and fully exploit its power. The concept of a statement prevalent in many programming languages has no meaning in BLISS-10. In reading this manual the reader should strive to master the implications and meaning of the following statement:

BLISS-10 is an expression language.

Once in a while I miss that in Python.  A consequence of this was no go-to's in BLISS.  Instead of 'go-to', the original BLISS had 8 different 'escape' instructions to break different types of control structures, but by the time I used it you could label expressions and use that label to break out of them.

BLISS also had coroutines, which I tried out and remember thinking were neat.  I'm still waiting for just the right application for coroutines.

BLISS was developed by W. A. Wulf at CMU who was involved in a number of programming language activities there.  You may have heard of HYDRA and Tartan Laboratories which he led.  I still fondly recall Harbison & Steele's C, a reference Manual that came out of that work.  Work on BLISS started in 1969 or 1970, possibly slightly before C began at Bell Labs, so the idea of a language suitable for writing systems software was being discussed, but was far from an accepted idea at the time.

References:

W.A. Wulf, BLISS reference manual :
a basic language for implementation of system software for the PDP-10 /, 1971.

"BLISS: A Language for Systems Programming", W. A. Wulf et al, CACM 14(12):780-790 (Dec 1971).

BLISS-10 Programmer's Reference Manual DEC-10-LBRMA-A-D, Digital Equipment Corporation, Maynard, MA. 1974.

--Th

OpenURL 1.0 for Reptilian Brains

Brainreptoid "And now for something entirely different."  Jeff Young has been trying to convince people to use OpenURL 1.0 where it is appropriate (and maybe a few places where it's not so appropriate but that isn't the case here), with little success.  Here's his late-night-rant trying to explain it one-more-time to people-that-just-don't-get-it (lightly edited for family viewing):

OpenURL 1.0 for Reptilian Brains

According to conventional wisdom, OpenURL 1.0 is unnecessary and/or too complex for certain needs. I intend to refute both notions.

"After all, it's hard to argue against the reptilian brain"

If you read this article and respond with "OK, but what about ContextObject Representations, Referring Entities, and yadda yadda yadda", it is irrefutable proof of a reptilian brain at work. Start over. Forget everything you (think you) know about OpenURL 1.0. Everything. Kill the Buddha! If it's not mentioned here, chances are excellent that you shouldn't care.

What is OpenURL good for?

OpenURL provides a simple, common, URL convention for invoking web services. Services built using these conventions can be mixed and exploited by others in innovative ways that you can't possibly imagine. I'm serious. If you choose to invent a new protocol knowing that the conventions described here would work perfectly well, you are willfully denying the world of this wondrous potential for interoperability and will surely be ostracized from the "Plays Well With Others" club.

What are the significant limitations of OpenURL?

OpenURL is a machine-to-machine interface. As shown below, the URLs aren't difficult to construct, but their general dependence on URIs makes them unwieldy for human consumption. Actually, the human-to-machine factor isn't as significant as it might appear, but that's a different story.

How simple is "simple"?

Here is an example of a request that is 100% OpenURL 1.0 compliant:

Ignore the result produced by this particular example. As OpenURL service providers, you and your cohorts get to decide this for yourselves.

What does this mean?

Why does everyone do this differently? It's a full-time job for someone at OCLC to decipher these kinds of systems so we can use them in clever ways. Imagine the possibilities for interoperability if everyone encoded the ISBN in an rft_id parameter instead!

Why call this field "rft_id"? Why not call it "uri" or "foo" instead?

It's a small price to pay for the all the benefits of OpenURL compliance. Someday you'll understand. In the mean time, <text deleted /> deal with it.

"rft_id" means "Referent Identifier". This tells the web service the identity of the resource your request refers to (e.g. an ISBN). This is an awkward way to say something that should be obvious. We can all agree that the label "rft_id" is as ugly as sin. It's aesthetics were obviously inspired by a carburetor. From a mechanic's perspective, however, the beauty is sublime. You bolt it on and off you go. Without it, you're parked. (updated 6 March 2006).

But what about Context Objects, Transports, and yadda yadda yadda?

Reptilian Brain alert! Here is some pseudo-code for a 100% compliant OpenURL 1.0 resolver:

 servlet(args) {
   write 'Hello ' + args['rft_id']
 }

Now, go back to the beginning of this article and start over.

Call now!

For the low monthly fee of url_ver and rft_id, you will not only organize your own default services for known items but win the praise and patronage of friends, neighbors, and complete strangers too! But wait! That's not all! Call in the next 10 minutes, and receive free... FREE!... 12 great attachments to do the dishes, mow the lawn, take out the garbage, and much MUCH more! Our operators are standing by!

Ring Ring Ring

Hello?

Hi, I want to order OpenURL 1.0, but I don't want the free dishwashing attachment.

No problem.

Great. But that leads to another question. If you don't send me the attachment, how will it do my dishes?

IEEEEEEEEEEEE! <thump/><thump/><thump/>

Note by Th: Another page in a similar vein from what is now part of OCLC: Idiot's guide to Implementing OpenURL 1.0 for Journal Articles

Repository Comparison

Repository_flowchart 'Repository' means different things to different people, even within the library community, but there are some common themes.  Last year Ralph LeVan put together a framework to help us think about and compare some repositories we were involved with at the time.  Since then Jeff Young has done a lot of work on WikiD which grew out of our work with OAI-PMH repositories.  Jeff and I occasionally have discussions about how to explain WikiD to people and it occurred to me that it would be interesting to add WikiD to the three repository systems that Ralph looked at and see how it compares.

The three other systems are CONTENTdm, DSpace, and Fedora.  As you are probably aware these are very different systems--so different that it may never have occurred to you to compare them on a feature-by-feature basis, but that's what Ralph did.  Here are three tables, each focusing on a set of features: Data Support, User Support, and Miscellaneous Infrastructure.  A solid bullet means good support, an open bullet partial support, and a blank implies the system doesn't have anything worth mentioning for that feature.

Data Support

CONTENTdmDSpaceFedoraWikiD
Arbitrary Bitstreams .
Arbitrarily Complex Objects
Versioning. .
Local Metadata Elements .
Preservation Metadata. .
Batch Input
Rich Metadata Searching
Full-text Searching .

WikiD can accept arbitrary XML, but doesn't handle binary objects yet.  It's batch imput (the ability to read in a file of objects) is limited to MARC-21 or something in an OAI-PMH repository.  Ralph is working on giving Pears the ability to automatically generate descriptions of arbitrary XML databases, so that will extend WikiD's searching capabilities (right now something like full text search requires manual editing of Pears configuration files).

User Support

CONTENTdmDSpaceFedoraWikiD
User Roles with Privileges . .
Workflows. . .
Object Marshalling . .
Arbitrary Bitstream Retrieval .
Arbitrary Object Retrieval. . .
Web Interface .
Content Easily Integrated into Web Pages .

Object marshalling is the ability to compose a complex object (e.g. an album consisting of multiple mp3 files plus other information about the album) for submittal, and possibly manage the relationships between the component objects.  Other than the Web interface that comes naturally with WikiD, it doesn't do so well in this chart.

Miscellaneous Infrastructure

CONTENTdmDSpaceFedoraWikiD
OAI-PMH
Z39.50/SRW .
Open Source.
Open APIs .
Cross-Repository Searching . .

Miscellaneous infrastructure is where WikiD really shines.  In fact that's what it is built on!

I realize that not too many people can critique how WikiD is presented in the charts, but there are many people that have more experience with some of the others than we do.  Anyone take issue with the marks?  Maybe our biases are showing, or our information is just dated.  What other sort of features should have been included?

I suspect that most repository selection decisions are made with only a passing interest in the capabilities reflected in these charts, but as soon as you want your repository to do something new, these characteristics can become very important.

If there is enough interest, I'll consolidate the comments (both here and email) and do some new charts.

--Th

My Photo

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31