I mentioned the ETDMS (Electronic Theses and Dissertations Metadata Standard) in my last post. Although here at OCLC we've been involved with it since the beginning, I was never optimistic about it getting widespread support.
Luckily I was wrong about that. For the 185,000 ETDs we harvest, we're finding over 130,000 of the records have an ETDMS version. The main thing that ETDMS gives us over simple Dublin Core is more information about the degree being granted, such as the name of the degree, the discipline, and the grantor. Since the NDLTD is looking into coming up with some standard ways of encoding those fields, I thought it would be interesting to run some quick statistics on what's actually in the fields now.
Tom Dehn extracted the germane fields from the ETDMS records we have and I wrote a little Python script to categorize the degree Grantor, Name, and Discipline fields. I simply took the text of the fields, did simple NACO normalization on it and sorted the results:
Degree Names
Out of the 130,598 ETDMS records there were 254 distinct degree names in 83,705 records.
Here are the top degree names:
| Count | Text |
|---|---|
| 55,080 | thesis |
| 6,102 | phd |
| 5,682 | master |
| 4,234 | ms |
| 3,062 | mestre |
| 1,242 | doutor |
| 969 | ph d |
Obviously there are some problems here. Other than the nonsensical 'thesis' as a degree name, just in this top list we have multiple ways to enter Ph.D. and masters, and I assure you it doesn't look any prettier farther down the list. Of course within a particular record, these strings make sense. It's only when they get aggregated, or we start to think about how to search for them that the problems emerge. This, of course, is the metadata problem. At least with ETDMS records we can do something about it.
Degree Disciplines
I found 969 different Degree Disciplines in 24,108 records. Here are the top six:
| Count | Discipline |
|---|---|
| 1,041 | mechanical engineering |
| 976 | electrical engineering |
| 861 | electrical and computer engineering |
| 825 | chemistry |
| 537 | physics |
| 524 | civil engineering |
Again, this shows some problems, in that a simple search for the phrase 'electrical engineering' is going to miss the 'electrical and computer engineering' theses. Just for fun, I took all 969 disciplines and looked for 'comput' in them:
| Count | Discipline |
|---|---|
| 861 | computer science |
| 214 | school of computer science |
| 69 | computing |
| 24 | computer networking |
| 20 | computation and neural systems |
| 3 | departament denginyeria i cincia dels computadors |
| 1 | genetics bioinformatics and computational biology |
| 1 | electrical and computer engineering ecpe |
| 1 | department of computer science queen mary university of london |
| 1 | computerlinguistik universitaet duisburg essen |
| 1 | computational mathematics |
| 1 | applied computational mathematics |
Not really too bad. One conclusion I've come to is that here at OCLC we need to put more effort into our MARC21 to ETDMS conversion. Some of the unusual data in these tables is coming from converted WorldCat records. We did that conversion early-on and haven't revisited it. I think we can do a better job.
--Th
Acknowledgment: The photo at the top of the posting was taken by Justin Hickey.
I believe there is a language problem. The names of degrees as well as disciplines are written in the local languages of the institutions.
Since the element thesis.degree in the ETD-ms is repeatable, maybe it should be recommended that, at least, the degrees were written in English, besides the original language. The desirable situation would be to have all info in this element in the 2 languages.
Posted by: Ana Pavani | April 13, 2005 at 12:55
Thanks for the acknowledgment!
Posted by: Justin | April 13, 2005 at 23:52
Even if we encourage an English version (in addition to the local language) of the level and degree, there will still be quite a bit of variation. I think we need to agree on a standard list to choose from. In lieu of that, we might be able to do some clean-up in the union catalog.
--Th
Posted by: Thom | April 14, 2005 at 12:59