« July 2005 | Main | September 2005 »

Metasearch and metadata

Nlalogo_l2 Lorcan has a long posting about metasearch on his weblog and a comment by Judith Pearce, of the National Library of Australia asking questions about the role of fielded searching.  An associated question that I've seen in some of NLA's papers is the role of controlled vocabularies.  Since these questions get to the heart of what the OCLC's metadata services, this is important to us.  In addition to a few opinions, I've actually been doing some experimentation that addresses some of the issues.

First some opinions.

  1. Everyone would rather not deal with fielded searching, if they can avoid it
  2. You can't avoid it, at least not always
  3. Controlled vocabularies help
  4. Centralized systems work better
  5. Retrieval systems need ranking
  6. Speed counts

Continue reading "Metasearch and metadata" »

AJAX and standards

Viafaust The initial impetus for changing to AJAX for Web applications is to improve the user experience by avoiding screen refreshes and increasing the speed of interaction.  Another benefit, however, is that others can take the services your application is built on and build new ones.  I reported earlier on an experiment by Mike Teets with Google maps.  His first prototype broke after two days because Google changed something in their implementation.  This is the sort of thing that an implementation based on standard protocols would help avoid.

Although one of the basic techniques in AJAX is to use a JavaScript method called XMLHttpRequest, the applications I've looked at don't use XML at all (contrary to what the current Wikipedia entry would have you believe).  I wanted to try something like Google suggest for the VIAF project and wondered if it is possible to do that in standards-based way.

Continue reading "AJAX and standards" »

Calculating suggestions

Viafsuggest I've mentioned before experimenting with a feature similar to Google-Suggest.  There are AJAX issues in getting the information displayed properly, but first you have to have the information ready to send.  The processing is fairly simple, but it turns out there are some tricks in calculating it that make both the calculation tractable and the resulting data structure small enough to load into main memory.

The target data structure is a hash table indexed by a string that returns a list of possibilities in ranked order that can be sent to the browser, pretty much instantly.  Since we're working with name authorities that have quite a bit of structure, we need to index them with a normalized form of the name, but return a properly formatted name (upper/lower case, diacritics, and even subfield delimiters for authority work).

Continue reading "Calculating suggestions" »

WorldCat growth

Flower The Watch WorldCat grow site is always fun to look at.  Right now it is showing 999,410,268 holdings.  Estimates are that we'll hit 1,000,000,000 this week.

--Th

Update August 11, 2005:  I guess it happened, but you can't tell because we turned off the site, evidently to avoid the possibility that someone would see 1,000,000,000 holdings go by and assume that the displayed bib record is the one that got the holding (and our Content Management System was melting down from all the watchers!).  I was disappointed not to watch.

BillionUpdate August 12, 2005: It's official, it really did happen yesterday.  There's a page here about it, along with the log showing who just missed it (there were three other holdings added in the same second, about normal for that time of day).  Just in case that page changes, here's the interesting part.

--Th

Software licenses

Opensource110x95 Around three years ago we spent quite a bit of time coming up with a new license for our open source software and got it approved by OSI, the Open Source Initiative.  I was involved, and it wasn't easy.  Unfortunately, we've become disenchanted with it.  The closer I read it, the less I understand it, and most people that want to use our software come up with some questions, most of which are hard to answer.

So what should we do?  From Research's perspective, we would like as wide use of our software as possible.  The cooperative, however, might like to get more back than just recognition, such as fees for commercial use, and it would certainly be nice if we had access to modifications people make to our code.  I don't really believe, however, that the GNU GPL would solve our problems, being fairly well persuaded by Eric Raymond's arguments that it actually inhibits the use of open source software.

I've looked at the MIT license, which is very close to the BSD license.  They are certainly short and sweet!  The Apache license is longer, but seems to accomplish much the same thing (you can do anything with the software, but include this notice with it).

Are there other directions we should be considering?  How should a not-for-profit research organization license its code?

--Th

Embedding XSLT in XML

Viaffull The W3C describes how you can embed a stylesheet within an XML file and then refer to it, allowing the document to transform itself.  Unfortunately IE and Firefox don't seem to support this, but I did find a description of how to embed data within a stylesheet, giving similar effect.  This doesn't solve my current problem (more about that below), but it does make it much easier to show examples of stylesheets.  In particular, here is the authorities stylesheet described earlier with an embedded SRU searchRetrieve response (looks best in Firefox).

The reason I was trying this was to see if embedding the stylesheet in the SRU response would speed up the displays.  For the VIAF prototype we're working on I was trying to get a Google suggest-type service to show potential headings as the user typed in a query, and to do it using SRU (Google sends the source for JavaScript arrays down and runs 'eval' on it to get objects.)  Two things seem to be slowing this down.  The first is that the SRU responses (with full records) are huge (~150K), slowing both transmission and processing.  Ralph LeVan says that by using different schema we could get around that.  The other problem is that by the time associated style sheets are pulled in, too many HTTP interactions take place.  Maybe different settings on my browser cache would help, but it would be nice to avoid those calls altogether.

Haven't solved this yet, although if you look at the example's source I have managed to embed the CSS stylesheet within the XSLT.

--Th

XML and CSS

Css One problem with using XSLT to format Web pages in the browser is that some don't support it, Opera in particular.  Looking into this, I was looking at the Opera site and they claim to be able to parse and display XML documents.  Looking around, the approach you need to do this is to associate a CSS style sheet with the XML and use JavaScript/DOM to do any transformations needed.  Here's a good explanation of the technique.  The example they give, though, only worked correctly for me in Opera, not Firefox or IE.

From all I've read, it looks like Opera has done a great job supporting CSS compared to other browsers, and you can do some amazing things with CSS. So, we could probably make the Dewey Browser's client-side formatting work with Opera, but it would take quite a bit of work.  (The Dewey Browser does make extensive use of CSS, but it is applied to the HTML page that the XSLT generates, not on the raw XML page the server sends out.)  It would probably be easier for us just to detect Opera and do the transformation from XML to HTML on our server, although for the number of visits we see from Opera on our site for other things, we may not even do that right away.

I was wondering why this style of programming never caught on.  It's been available longer than XSLT support, but I wasn't even aware of the possibility, much less seeing examples of it.  Maybe it's a case of a technology being ahead of its time, or maybe DOM transformations are just easier to do with XSLT than with JavaScript.  Or maybe by the time CSS support became reliable, XSLT was there too.  Whatever the reason, I doubt if avoiding XSLT is going to catch on.  Looking around the Web this seems to have been a bigger topic a few years ago (XSL Considered Harmful), but the arguments get a bit muddy, since they worry about XSL Formatting Objects, which is a direct competitor to CSS, and another technology I was only barely aware of, even though a book I sometimes use does talk about it.  As far as I can tell the major browsers don't support XSL-FO (the XML stylesheet language).  W3C has an overview page of XSL's three parts: XSLT, XML Path, and XSL-FO.

--Th

MARC authorities formatting

Viaffull I'm not sure how many people ever worry about formatting authority records, but we've been working on an interface to the Virtual International Authority File (VIAF) which needs to do just that.  Since we're retrieving these records via SRU, they come back in MARC-XML and I put together an XSLT script for the display.  I like it.  It has improvements over Connexion's displayConexfull, in that the dates and times are formatted for readability.  The records for the two images aren't the same, since the display from VIAF is the PND record from DDB and has extra information pulled in from associated bibliographic records to aid in matching.

If you are interested, here is the XSLT file and the associated CSS script.  I wasn't sure what the rules are for adding the century to the 008's 6-byte date-entered field, so I just guessed.  If anyone has suggestions on how these records should be formatted, please let me know.

--Th

Continue reading "MARC authorities formatting" »

UK holdings

Seamus Lorcan has had some comments on his posting about Book handles about people not finding copies of books in the UK via Open WorldCat.  It is true that WorldCat's coverage is much better in the US than other areas of the world, but it is interesting to see that some items show quite a few UK holdings, and even a few in Ireland. You can find this version via Yahoo or Google, but if you want to point to the large print edition, you'll have to use the Find in a Library link (no UK holdings showing for it today).

Looking at the 'related editions' and checking the records, there seems to be some disagreement about who the author is of these (and, for the Cliff Notes version what the title is), which interfered with the edition linking.  It wouldn't be too hard to get the records consistent and then we'd have a better display.

--Th

Links into Open WorldCat

Jane You can now link into Open WorldCat with ISBNs, ISSNs, and OCLC numbers.  Here's a page explaining how to do it.  So now when you want to link to one of your favorite books, you can use http://worldcatlibraries.org/wcpa/isbn/0679601686 instead of something like http://www.amazon.com/exec/obidos/ASIN/0679601686. Or even link to a classic edition: http://worldcatlibraries.org/wcpa/oclc/855736.  And, I was pleased to note, the FRBR work we've done bringing together editions of works means that if you click on the 'Other editions of item' link in the Find in a library page, you'll see that the same list comes up for both.

The WorldCat pages seem to come up quicker than Amazon's too.

Update (February 2007): These links still work, but the form is deprecated.  Follow the page link for the latest information.

--Th

Software as art

Chela

There's an interesting article in the Communications of the ACM (Software as art by Gregory W. Bond, August 2005/vol 48 #8, pp 118-124.  You might be able to get to it here)  that relates a bit to my recent Tight code post.  I still think it's a bit of a stretch (I think the examples show more craft than art), but I enjoyed reading it because it reminded me of Don Knuth and all the work we did here with both TeX and Metafont.  Knuth was a big proponent of what he calls literate programming.  I heard him talk about it once when he had just started.  He likened it to the goodness of 'sliced bread', but really didn't think sliced bread was anywhere near as good.  After his first experience in being able to rearrange, format, and print out a program the way he wanted, he felt as if this should eliminate bugs, the program was so clear to him.  Didn't work out quite that way, but it does make programs easier to read.

Continue reading "Software as art" »

My Photo

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31