« Pervasive content | Main | The day I almost met James Gosling »

FRBR Ranking

Numberblocks There has been an interesting discussion on the Code4Lib mailing list the last couple of days about how to rank results in a FRBR environment.  I weighed in with the common opinion around here (at least in OR) that the major factor in ranking should be some sort of popularity score.  We typically use the total number of WorldCat holdings for the work, but it would seem as though circulation data could be used as well.  Other ranking criteria, such as the number of times a term occurs, I claimed are secondary at best.

Shortly after posting that, we had a visitor that pointed out a weakness in ranking only by library counts.  Diane Vizine-Goetz was demonstrating a soon-to-be-released version of FictionFinder by searching for 'Don Quixote' and the second most highly ranked item was Henry Fielding's History of the Adventures of Joseph Andrews, "A Henry Fielding novel written to imitate the action of Cervantes' romantic-heroic character, Don Quixote.'"

Now obviously the Fielding novel is related to Don Quixote, but it doesn't seem as though it should be second in the list, especially because there were several other 'works' listed that look as though they should have been included in the main Don Quixote group, but were missed because of title variants (e.g. The Ingenious Gentleman Don Quixote de la Mancha).  It's even conceivable that Joseph Andrews could have come ahead of Don Quixote in the list if it had more library holdings.  (Actually, it isn't even close at 4,866 holdings versus Don Quixote's 40,257).

So, I think it is clear that a simple library count isn't the best possible way to rank FRBR work-sets.  What should be done to fix it is less clear.  In the above example the string 'Cervantes, author of Don Quixote' actually appears in the subtitle of many manifestations of Joseph Andrews.  Right now ranking by library holdings is fairly understandable, and in our experience works very well.

--Th

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83459bf2269e200e550717e3a8833

Listed below are links to weblogs that reference FRBR Ranking:

» Hickey on ranking from The FRBR Blog
As I mentioned, there was a thread on the code4lib mailing list about how to rank FRBRized search results. Thom Hickey, of OCLC, posted a few times, and on Thursday he followed up on his blog, with FRBR Ranking. ... [Read More]

Comments

It seems to me that some more direct measure of current interest by the general public would be ideal to include in the formula. That could mean anything from recent search rankings to items that have advanced requests/holds on them. Numbers on items in current circulation get close, but it still seems like there would be a lag before you'd see what's truly of interest to people right now.

I like Mickey's points, particularly about reserve lists. I suspect that "type library" will make a difference here as well. Thom, could you provide some examples of what you consider optiimal search results, with explanations of how they work (instructed liturgy, as it were)?

Amazon now provides citation counts for lots of books. Older works tend to have higher accumulated citation counts, and it is possible that combining citation counts with other factors such as circulation, sales rank and library holdings and subjective "review ratings" (both absolute number of reviews and average rating) could give useful ranking data. But ranking isn't enough on its own: all that MARC classification goodness can be used to generate clusters (and a lot "cheaper" than using Latent Semantic Analysis).

You guys are talking about ways to improve a ranking of 'popularity' or 'general interest level' that the OCLC holdings ranking is already intended to do. There might be a way to get a better measure of 'public interest level'.

But the problem Thom identifies isn't about that at all. It's about the fact that the most popular item in the result set isn't neccesarily the one that has anything to do with what the user wants. Should a record where the search query happens to match once in a 500 field outrank items where the seach query matches in an access field, merely becuase the first item is more popular? More popular, but probably not as relevant to the user's query. Fine-tuning measurements of popularity by using hold requests isn't going to help this issue.

But yeah, the key question is identifying what we consider 'optimal' results/ordering of results. Of course, different searchers will consider different things optimal. But we still have faith that some orderings are better than others.

I still think in a general library catalog, simply measuring popularity and nothing else for rankings isn't going to cut it. WorldCat, on the other hand, as opposed to a single library catalog--I suspect that at the moment, it's most frequent use is for known item searching. Simple popularity ranking of one sort or another may very well be the most useful order for a known item search.

Okay, I went and found a concrete example for you. Go and search for 'Orwell' as a keyword search in WorldCat. First hit? "Eats shoots & leaves." Becuase it's got an abstract/publisher's advert in the record, which just happens to mention Orwell in passing, and it's held by more libraries then any other item with orwell in the record (including books actually by or about Orwell).

Is "Eats Shoots & Leaves" likely to be what a user searching for "Orwell" is looking for? Only a minority of the book is even about Orwell.

The 2nd and 3rd items in the list are works by Orwell. The 4th is again an item with a tenuous connection to Orwell. The 5th is "Eyewitness to History", which includes one essay (among many dozens) by Orwell--but it's held by lots of libraries. 6-8 are all actually by or primarily about Orwell. 9 is another collection which includes one essay by Orwell among many not.

Is this really putting items most likely to be of interest to the user entering 'Orwell' as a query first? Doubtful.

This doesn't address popularity, but couldn't the presence and maybe position of the words in the subject heading help to some extent in determining relevance?

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.

My Photo

June 2009

Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30