Someone reading the last post about UNIMARC might wonder why we aren't just doing everything in XML. Ignoring the problem of transforming the UNIMARC into XML (I'm sure it's been done), my group has tended to avoid using XML for most of our bibliographic processing. One reason is a space. Standard MARC21/slim XML is 3x the size of MARC Communications Format (MCF) records. A copy of WorldCat would go from under 60 gigabytes to nearly 180 gigabytes. An even more compelling reason was speed. We do a lot in Python using the standard Python tools, and processing XML took about 6x as long as our routines that parse the same records in MCF.
So I was interested to notice that Python 2.5 (currently in beta-3) includes the core of ElementTree in it. Here's what effbot.org says about ElementTree:
The Element type is a simple but flexible container object, designed to store hierarchical data structures, such as simplified XML infosets, in memory. The element type can be described as a cross between a Python list and a Python dictionary.
Even more interesting is that the Python distribution includes a C implementation of it (cElementTree). Would that be fast enough to make XML worth the space problems? What if we just compressed our XML files? XML files are nearly 3x the size of MCF, but compressed they are 3x smaller than MCF. I timed three possibilities reading in 1.4 million bibliographic records, extracting titles out of each record. Here are the results using cElementTree to parse the XML, our standard Python routines for MCF:
| Format | Size in bytes | Run time | Recs/sec |
|---|---|---|---|
| MCF | 1,372,366,461 | 455 | 3,081 |
| XML | 3,617,800,493 | 302 | 4,641 |
| Zipped XML | 401,668,287 | 343 | 4,087 |
cElementTree has a 10x speed advantage over ElementTree and supports the iterparse method that lets you step through the file without reading it all in, sort of SAXlike. In addition ElementTree seems to be a more Pythonic way of manipulating XML than the standard SAX and DOM approaches.
It looks to me as though cElementTree is a winner, especially if we go to the trouble of compressing our larger files. Now if only it supported full XPath...
--Th
Thom,
Check out LXML -- it's a wrapper around libxml2/libxslt and as such it supports full XPath, XSLT etc etc, while remaining as true to the elementTree API as is feasible.
http://codespeak.net/lxml/
The lxml folks did some benchmarking and found it to be at least as fast as cElementTree in most circumstances (full details should be on the lxml site)
-- Azaroth
Posted by: Rob Sanderson | August 17, 2006 at 07:36
About XML and UNIMARC, see BiblioML at
http://90plan.ovh.net/~adnx/biblioml/doku.php?id=en:introduction
Posted by: Nicolas | August 22, 2006 at 03:45
Hi Thom,
It was great meeting you last week in Bloomington. I just checked your blog, and found this gem. At Knewco, we use python for most of our stuff, and Zope for our web applications. For XML parsing we also use cElementtree (for Medline XML), and are pleasantly surprised by the speed. My prevous experience was perl/expat, but python/celementtree has decreased both development and processoing time
Posted by: Marc Weeber | September 04, 2006 at 15:31
FWIW, Thom:
I ran a quick benchmark for my input XML file
[size:4303244746Bytes::uncompressed]
[numberOfRecords:1691425]
Benchmark:
[Seconds:566.2::RecordsPerSec:2988]
Environment:
[Ruby,libxml::Full XPath support] on my [MacBookPro:2.4 GHz Intel Core 2 Duo::RAM:4GB]
[Development time:marginal]
I would imagine that dealing with compressed XML would help increase the speed. However, I've thrown 200 XPath expressions at the XML file and have seen a marked degradation in performance (upto 25 minutes for the run). While Ruby typically gets bashed for its speed - and justifiably so in a few contexts - it excels at being able to gather linked data from multiple sources, an activity I see myself increasingly doing, to enrich the data base...
Posted by: Shailen Karur | November 14, 2008 at 17:15