At ALA I participated in the PCC Participants Meeting giving my thoughts on a discusion paper about what to do with the undifferentiated personal name headings in the LC/NAF authrority file. This made me think a bit about how we see authority identifiers given our experience with VIAF and WorldCat Identities.
Our experience includes:
- Ingesting 30 million authority records from two dozen different authority files
- Aligning them with 120 million bibliographic records
- Matching, deduplicating and disambiguating the 30 million authorities
- Linking to other descriptions of the entities on the Web
- Extracting and trying to make sense of the 40 million different names found in the nearly 300 million bibliographic records in WorldCat
Some simple observations:
- Matching is easier and more reliable when a stable identifier is involved
- There is a trend towards multiple 'established forms', esp. across scripts. The use of non-Latin scripts is especially important when trying to differentiate Chinese and some other Asian names
- Many files make no attempt to make the established forms unique across the whole file, much less the alternate forms
- A string with differentiating hints is useful for selecting/identifying an entity
- That string will almost inevitably be less stable than the ID assigned to the record. Cool URIs is another reflection of this
- The Identifiers should have a URI representation that is actionable (linked data)
- In a linked world, knowingly grouping multiple people together under one identifier because their names are similar is hard to justify
If the issue with undifferentiated names is truly the difficulty in constructing a unique string for a name under current rules, my suggestions are straightforward:
- Relax the rules (or their interpretation) to allow easier construction of unique strings (e.g. Pianist, Author of ...)
- Move to relying on identifiers, not the established form. This is the equivalent of prefering 'opaque' identifiers rather than identifiers with semantic baggage. The sematics change or the baggage changes and your identifier is not as stable as you would wish
- Given that 2) is easier said than done, go with 1) for now
My impression from the discussion afterwards is that PCC will probably decide the 60,0000 undifferentiated names should be differentiated. The details on how that might be accomplished are much less clear. There were multiple views on several of the issues involved.
--Th