FRBR Ranking
There has been an interesting discussion on the Code4Lib mailing list the last couple of days about how to rank results in a FRBR environment. I weighed in with the common opinion around here (at least in OR) that the major factor in ranking should be some sort of popularity score. We typically use the total number of WorldCat holdings for the work, but it would seem as though circulation data could be used as well. Other ranking criteria, such as the number of times a term occurs, I claimed are secondary at best.
Shortly after posting that, we had a visitor that pointed out a weakness in ranking only by library counts. Diane Vizine-Goetz was demonstrating a soon-to-be-released version of FictionFinder by searching for 'Don Quixote' and the second most highly ranked item was Henry Fielding's History of the Adventures of Joseph Andrews, "A Henry Fielding novel written to imitate the action of Cervantes' romantic-heroic character, Don Quixote.'"
Now obviously the Fielding novel is related to Don Quixote, but it doesn't seem as though it should be second in the list, especially because there were several other 'works' listed that look as though they should have been included in the main Don Quixote group, but were missed because of title variants (e.g. The Ingenious Gentleman Don Quixote de la Mancha). It's even conceivable that Joseph Andrews could have come ahead of Don Quixote in the list if it had more library holdings. (Actually, it isn't even close at 4,866 holdings versus Don Quixote's 40,257).
So, I think it is clear that a simple library count isn't the best possible way to rank FRBR work-sets. What should be done to fix it is less clear. In the above example the string 'Cervantes, author of Don Quixote' actually appears in the subtitle of many manifestations of Joseph Andrews. Right now ranking by library holdings is fairly understandable, and in our experience works very well.
--Th
It seems to me that some more direct measure of current interest by the general public would be ideal to include in the formula. That could mean anything from recent search rankings to items that have advanced requests/holds on them. Numbers on items in current circulation get close, but it still seems like there would be a lag before you'd see what's truly of interest to people right now.
Posted by: Mickey Hawk | April 17, 2006 at 08:58
I like Mickey's points, particularly about reserve lists. I suspect that "type library" will make a difference here as well. Thom, could you provide some examples of what you consider optiimal search results, with explanations of how they work (instructed liturgy, as it were)?
Posted by: K.G. Schneider | April 17, 2006 at 19:06
Amazon now provides citation counts for lots of books. Older works tend to have higher accumulated citation counts, and it is possible that combining citation counts with other factors such as circulation, sales rank and library holdings and subjective "review ratings" (both absolute number of reviews and average rating) could give useful ranking data. But ranking isn't enough on its own: all that MARC classification goodness can be used to generate clusters (and a lot "cheaper" than using Latent Semantic Analysis).
Posted by: Kent Fitch | April 26, 2006 at 19:45
You guys are talking about ways to improve a ranking of 'popularity' or 'general interest level' that the OCLC holdings ranking is already intended to do. There might be a way to get a better measure of 'public interest level'.
But the problem Thom identifies isn't about that at all. It's about the fact that the most popular item in the result set isn't neccesarily the one that has anything to do with what the user wants. Should a record where the search query happens to match once in a 500 field outrank items where the seach query matches in an access field, merely becuase the first item is more popular? More popular, but probably not as relevant to the user's query. Fine-tuning measurements of popularity by using hold requests isn't going to help this issue.
But yeah, the key question is identifying what we consider 'optimal' results/ordering of results. Of course, different searchers will consider different things optimal. But we still have faith that some orderings are better than others.
I still think in a general library catalog, simply measuring popularity and nothing else for rankings isn't going to cut it. WorldCat, on the other hand, as opposed to a single library catalog--I suspect that at the moment, it's most frequent use is for known item searching. Simple popularity ranking of one sort or another may very well be the most useful order for a known item search.
Posted by: Jonathan | April 30, 2006 at 17:09
Okay, I went and found a concrete example for you. Go and search for 'Orwell' as a keyword search in WorldCat. First hit? "Eats shoots & leaves." Becuase it's got an abstract/publisher's advert in the record, which just happens to mention Orwell in passing, and it's held by more libraries then any other item with orwell in the record (including books actually by or about Orwell).
Is "Eats Shoots & Leaves" likely to be what a user searching for "Orwell" is looking for? Only a minority of the book is even about Orwell.
The 2nd and 3rd items in the list are works by Orwell. The 4th is again an item with a tenuous connection to Orwell. The 5th is "Eyewitness to History", which includes one essay (among many dozens) by Orwell--but it's held by lots of libraries. 6-8 are all actually by or primarily about Orwell. 9 is another collection which includes one essay by Orwell among many not.
Is this really putting items most likely to be of interest to the user entering 'Orwell' as a query first? Doubtful.
Posted by: Jonathan | April 30, 2006 at 17:19
This doesn't address popularity, but couldn't the presence and maybe position of the words in the subject heading help to some extent in determining relevance?
Posted by: Helen Anderson | June 09, 2006 at 10:58