Melvyl Recommender Project
The FRBR Blog shows an excerpt from the Full Text Extension Supplementary Report of the Melvyl Recommender Project for their current procedure for deciding whether two items should be considered the same work or not. They do this by calculating a score based on how well authors, titles, dates, and identifiers match. All in all, their procedure probably does a fairly good job of bringing together similar items, but I've never been a fan of assigning scores and then adding them up. They mention 'twiddling of knobs' to adjust the scores and my experience is that you never finish twiddling and that changes and additions to the scoring are very difficult to get right.
My preference is to use a decision table. Here's one for the Melvyl matching:
| Titles | E | E | E | - | P | P | P |
|---|---|---|---|---|---|---|---|
| Authors | P | E | - | E | E | P | P |
| Idents | - | - | E | E | - | - | P |
| Dates | P | - | E | E | P | E | - |
Here's how to use the table. For each of the rows, you decide whether the records have an Exact, Partial, or no match. These are ordered, so a P in the table means that value has to be at least a partial match. The first column then says that if you have an Exact title match, and at least Partial author and date match, then your records match. The hyphen in the Idents row means that for this column it doesn't matter how well the identifiers match. The last column shows that partial matches on all but dates result in a match, whether dates match or not. In order to match two records they have to satisfy at least one column.
We've found this sort of table easier to understand, extend, and modify. The Melvyl scoring fit very nicely into this because each of their criteria had three levels. In our experience that's the right granularity.
--Th
I like the idea of the table, and of three levels of granularity. I also find it easier to understand than adding up scores, but it had never occurred to me to try it out. Much easier to debug problems with a table-based approach, and also easier to explain to people.
Posted by: Martin Haye | October 23, 2006 at 14:53