We have had an on-again off-again interest here in trying to characterize databases. Ralph LeVan once had a project called Automatic Collection Description based on centroids found when indexing databases. A few years ago I did some work looking at which OCLC members had the most holdings for headings in WorldCat (and conversely what each OCLC member library had the strongest collections in by heading). For want of a better term I called this work Centers, as in Centers of Excellence. It looked interesting, but we never pursued it very far.
Recently, however, we have been doing a lot of work organizing WorldCat. We are doing automatic categorization into at least the WEM in FRBR WEMI (Works, Expressions, Manifestations and Items), controlling names to VIAF and adding FAST subject headings when we can. While we do much of this work on a separate copy of WorldCat, some of the data is starting to become visible at http://www.worldcat.org in the Linked Data section of the pages. As an add-on to our work reimplementing WorldCat Identities, I revived the old Centers code, and found that the world had changed more than I realized.
WorldCat is much larger (approaching 300 million records with 2 billion holdings), but new hardware and software (i.e. Hadoop) and a better organized WorldCat (works, VIAF and FAST IDs) make the whole process much easier and simpler.
But even more critically, we now have staff that has been able to take rather raw data, see possibilities in it and relate it to other work (see Constance Malpas's blog post on Concentration, Diffusion, Centers & Flows) and make suggestions.
Our latest attempt is a work-based categorization of headings and holdings of OCLC libraries (actually we are doing the work using OCLC symbols, which are sometimes more closely aligned to collections than libraries).
This is new work, not even formally a project yet, but I thought it might be of interest.
Here is how we did the latest calculation we are tentatively calling coverage, as in who has the best coverage of a topic:
- Start with OCLC Research's copy of WorldCat that has been enhanced with VIAF and FAST IDs
- For each holding attached to each record write out each controlled heading (all 4 billion!) along with its Work ID
- Sort those by heading ID
- For each heading find which works are associated with which symbols
- Write out the list of symbols that have the most works associated with each heading
That gives us the list of OCLC institution symbols that have the best coverage of each of the tens of millions of different headings in FAST and VIAF. It is then fairly straight-forward to invert that to show what headings each symbol has the best coverage of.
All written in a couple of hundred lines of Python running in Hadoop streaming map-reduce, taking a little more than an hour to complete on a quiet machine. Processing this amount of data (think about how many holdings from how many institutions are associated with Fiction) can a little tricky, but all-in-all easy enough to write. However it took years of low-level interest along with years of work enhancing WorldCat to get to the point where we found we could and should do it!
Here are a couple of examples. First, the beginning of the FAST heading Marine biology (http://id.worldcat.org/fast/1009447) showing the top 25 OCLC symbols with the best work-level coverage:
fst01009447 Marine biology total works=9,538
DLC 17.2% LIBRARY OF CONGRESS
CUS 17.0% UNIV OF CALIFORNIA, SAN DIEGO
HATHI 16.3% HATHITRUST DIGITAL LIBR
WAU 14.2% UNIV OF WASHINGTON LIBR
CDLER 13.0% UC MASS DIGITIZATION
HUH 11.2% UNIV OF HAWAII AT MANOA LIBR
CUY 9.8% UNIV OF CALIFORNIA, BERKELEY
OLA 9.8% NATIONAL OCEANIC & ATMOSPHERIC ADMIN
SMI 9.4% SMITHSONIAN INSTITUTION
TXA 9.1% TEXAS A&M UNIV
STF 9.0% STANFORD UNIV LIBR
RIN 9.0% UNIV OF RHODE ISLAND, NARRAGANSETT
MBW 8.8% MARINE BIOLOGICAL LAB/WOODS HOLE OCEANOG
ORE 8.5% OREGON STATE UNIV, CORVALLIS
YUS 8.5% YALE UNIV LIBR
COO 8.3% CORNELL UNIV
CSL 8.2% UNIV OF SOUTHERN CALIFORNIA
UIU 8.1% UNIV OF ILLINOIS
VIM 7.9% VIRGINIA INST OF MARINE SCI
PAY 7.9% UNIV OF TEXAS, MARINE SCI LIBR
YAM 7.9% AMERICAN MUS OF NATURAL HIST
LGG 7.3% MCGILL UNIV
NDD 7.1% DUKE UNIV LIBR
EYM 7.0% UNIV OF MICHIGAN LIBR
HMZ 6.9% HARVARD UNIV, ERNST MAYR LIBR-MCZ
And from a library point of view, the top headings found at OLA (National Ocieanic & Atmospheric Administration) showing both FAST and VIAF headings.
OLA NATIONAL OCEANIC & ATMOSPHERIC ADMIN
fst00590006 90.0% National Sea Grant Program (U.S.) count=1,150
fst00529308 54.5% United States.--National Oceanic and Atmospheric Administration count=588
fst01043561 31.3% Ocean temperature count=1,067
fst00820520 30.6% Atmospheric temperature count=1,124
fst00853069 29.8% Chemical oceanography count=681
fst01009724 27.9% Marine meteorology count=659
fst01074923 23.7% Precipitation (Meteorology) count=1,107
fst01043612 21.7% Ocean-atmosphere interaction count=709
fst01043704 20.6% Oceanography--Research count=1,050
fst01239980 20.4% Gulf of Mexico count=1,573
fst00865756 19.8% Coastal zone management count=2,961
fst01240719 19.7% United States--Atlantic Coast count=700
fst01423822 18.8% Observations count=4,502
fst01175820 18.8% Winds count=980
fst01009907 18.7% Marine resources conservation count=705
fst01104000 17.4% Salinity count=700
fst01173142 16.8% Weather forecasting count=918
fst01242497 16.8% North Pacific Ocean--w1700000--n0300000 count=853
fst01043489 16.5% Ocean currents count=672
fst01043671 15.6% Oceanography count=3,223
fst00940371 15.2% Geodesy count=934
fst01018441 15.2% Meteorology count=3,779
fst01310474 13.9% Atlantic Ocean--Chesapeake Bay--w0760251--n0370009 count=615
fst01242477 13.6% North Atlantic Ocean--w0400000--n0400000 count=985
fst00865723 13.1% Coastal ecology count=642
fst00864281 12.7% Climatology count=3,740
fst01089410 11.2% Rain and rainfall count=1,073
fst00964360 11.2% Hurricanes count=597
fst01009826 11.1% Marine pollution count=791
fst00926335 10.1% Fishery resources count=588
fst01009447 9.8% Marine biology count=931
fst01009513 9.7% Marine ecology count=730
fst00926228 9.0% Fishery management count=1,777
fst01240722 8.5% Atlantic Ocean--w0250000--n0100000 count=660
fst01243528 8.5% Pacific Ocean--w1321115--n0033048 count=1,250
fst00864229 6.5% Climatic changes count=1,041
fst01240563 5.9% Great Lakes count=593
fst00926051 5.3% Fisheries count=1,974
fst01239992 4.9% Antarctica count=884
fst01423732 4.2% Charts, diagrams, etc. count=625
fst00926361 3.3% Fishes count=774
VIAF|145220374 70.8% National Ocean Survey. Office of Coastal Zone Management. count=613
VIAF|148953552 69.1% United States. Office of Ocean and Coastal Resource Management. count=418
VIAF|131207352 62.2% Environmental Research Laboratories (U.S.) count=653
VIAF|139539362 42.3% United States. National Weather Service. count=529
VIAF|138297002 38.4% United States National Oceanic and Atmospheric Administration count=3,358
VIAF|123679918 31.0% United States. National Marine Fisheries Service. count=1,356
VIAF|137910886 26.6% Intergovernmental Oceanographic Commission. count=541
VIAF|125202651 21.0% Scripps Institution of Oceanography. count=381
VIAF|123076486 20.9% United States. Weather Bureau. count=1,063
VIAF|149549513 18.9% U.S. Coast and Geodetic Survey. count=1,459
VIAF|141399695 17.6% World Meteorological Organization. count=1,105
VIAF|154122553 15.2% United States Coast Survey. count=445
VIAF|153533725 15.2% United States. National Ocean Service. count=666
VIAF|129608645 14.2% Cold Regions Research and Engineering Laboratory (U.S.) count=478
VIAF|147081574 13.5% American Meteorological Society. count=421
VIAF|139421251 10.7% Woods Hole Oceanographic Institution. count=485
VIAF|139500419 5.1% United States. Office of Naval Research. count=496
VIAF|157704448 2.8% United States. Army. Corps of Engineers. count=857
The percentages refer to coverage in terms of works, for instance OLA has at least one holding in 18.8% of the different WorldCat works that have the FAST heading for Winds in them, according to our latest FRBR work-set algorithm.
--Th
Updated 2013.05.07 The 'by library' view had incorrect percentages, which also affected what headings were shown.