Here is yet another blog post giving first impressions of a new language and comparing it with one the writer is familiar with. In this case, comparing Google's Go language with Python.
A few of us at OCLC have been using Python fairly extensively for the last decade. In fact, I have the feeling that I used to know it better than I do now, as there has been a steady influx of features into it, not to mention the move to Python 3.0
Go is relatively new, but caught my eye because of some groups moving their code from Python to Go. Looking at what they were trying to do in Python, one wonders why they thought Python was a good fit, but maybe you could say the same thing about how we use Python. All of the data processing for VIAF, is done in Python, as well as much of the batch processing that FRBRizes WorldCat. We routinely push 1.5 billion marcish records through it, processing hundreds or even thousands of gigabytes of data.
We use Python because of ease and speed of writing and testing. But, it's always nicer if things run faster and Go has a reputation as a reasonably efficient language. The first thing I tried was to just write a simple filter that doesn't do anything at all, just reads from standard input and writes to standard output one line at a time. This is basic to most of our map-reduce processing and I've had more than one language fail this test (Clojure comes to mind). In Python it's simple and efficient:
import sys
for line in sys.stdin:
sys.stdout.write(line)
The Python script reads a million authority records (averaging 5,300 bytes each) in just under 10 seconds, or about 100,000 records/second.
Go takes a few more lines, but gets the job done fairly efficiently:
package main
import (
"bufio"
"io"
"os"
)
func main() {
ifile := bufio.NewReader(os.Stdin)
for {
line, err := ifile.ReadBytes('\n')
if err == io.EOF {
break
}
if err != nil {
panic(err)
}
os.Stdout.Write(line)
}
}
The Go filter takes at least 16 seconds to read the same file, about 62,000 records/second. Not super impressive and maybe there is a faster way to do it, but fast enough so that it won't slow down most jobs appreciably.
My sample application was to read in a file of MARC-21 and UNIMARC XML records, parse it into an internal datastructure, then write it out as JSON.
I had already done this in Python, but it took some effort to do it in Go. The standard way of parsing XML in Go is very elegant (adding notations to structures to show how the XML parser should interpret them), but turned out to both burn memory and run very slowly. A more basic approach of processing the elements as a stream that the XML package is glad to send you, was both more similar to how we were doing it in Python and much more efficient in Go. Although there have been numerous JSON interpretations of MARC (and I've done my own), I came up with one more: a simple list (array) of uniform structures (dictionaries in Python) that works well in both Python and Go.
Overall, the Go code was slightly more verbose, mostly because of its insistence on checking return codes (rather than Python's tendency to rely on exceptions), but very readable. Single-threaded code in Go turns about a thousand MARC XML records/second into JSON.
Which sounds pretty good until you do the same thing in Python and it transforms them at about 1,700 records/second. I profiled the Go code (nice profiler) and found that at least 2/3 of the time was in Go's XML parser, so no easy speedups there. Rather than give up, I decided to try out goroutines, Go's rather elegant way of launching routines on a new thread.
Go's concurrency options seem well thought out. I set up a pool of go routines to do the XML parsing and managed to get better than a 5x speedup (on a 12-core machine). That would be worthwhile, but we do most of our processing in Hadoop's map-reduce frame work, so I tested it in that. The task was to read 47 million MARC-21/UNIMARC XML authority records stored in 400 files and write the resultant JSON to 120 files.
Across our Hadoop cluster we typically run a maximum of 195 mappers and 195 reducers (running out of memory on Linux is something to avoid!). The concurrent Go program was able to do the transform of the records in about 7 minutes ( at least a couple minutes of that is pushing the output through a simple reducer), and the machines were very busy, substantially busier than when running the equivalent single threaded Python code. Somewhat to my surprise the Python code did the task in 6.5 minutes. Possibly a single threaded Go program could be speedup a bit, but my conclusion is that Go offers minimal speed advantages over Python for the work we are doing. The fairly easy concurrency is nice, but map-reduce is already providing that for us.
I enjoyed working with it, though. The type safety sometimes got in the way (especially missed Python dictionaries/maps that are indifferent to the types of the keys and values), but at other times the type checking caught errors at compile time. The stand alone executables are convenient to move around, the profiler was easy to use, and I really liked that it has a standard source file formatter. I didn't try the built in documentation generator, but it looked simple and useful, as does the testing facility. The libraries aren't as mature as Python's, but they are very good. We never drop down into C for our Python work, but we do depend on libraries written in C, such as the XML parser CElementTree. It would be nice to have a system where everything could be done at the same level (Julia or PyPy?), but right now we're still happy with straight Python and feel that its speed seldom gets in the way.
If nothing else, I learned a bit about Go and came up with a simple JSON MARC format that works quite a bit faster in Python (and Go) than my old one did.
--Th
The machine I used for stand alone timings is a dual 6-core 3.1 GHz AMD Opteron box and runs Linux 2.6.18 (which precluded loading 1.3 Go, so I used 1.1). I got similar (but slower) timings with Go 1.3 on my 64-bit quad-core 2.66GHz Intel PC running Windows 7, so I don't think that using 1.3 Go would have made much of a difference. Both the Go and Python programs were executed as streaming map-reduce jobs across 39 dual quad-core 2.6 GHz AMD machines running Cloudera 4.7.
Take a look at:
https://github.com/vistarmedia/gossamr
http://labs.vistarmedia.com/2014/07/29/hadoop-and-go.html
Can you run your test with this lib sgain?
Posted by: Gerald | August 04, 2014 at 19:55
Please also use go1.3 for your test!
Posted by: Gerald | August 04, 2014 at 19:56