Small Python, Java comparison

Fri Jul 11 16:03:01 EDT 2003

Below is some information I collected from a *small* project in which I wrote 
a Python version of a Java application. I share this info only as a data 
point (rather than trying to say this data "proves" something) to consider 
the next time the "Python makes developers more productive" thread starts up 
again.

Background
==========
An employee who left our company had written a log processor we use to read 
records from text files (1 record per line), do a bunch of computations on 
them, and load them into various tables in our database. Typically each day 
there are 1000 to 2000 files with a total number of records being around 100 
million. The nature of the work the processor's work (some of it is 
summarizing data) results in about 1 database insert or update for about 
every 20 "raw" log records.

After this employee left I inherited this chunk of code and the first thing I 
did was create a fairly comprehensive test suite for it (in Python) because 
there was no test suite and in the past we had had many problems with bugs 
and flakiness in the code. As I did this I got pretty familiar with how the 
code worked and noticed lots of places where the language seemed to "get in 
the way" of what needed to be done. During this time I also began hearing 
rumblings about how this processor would need to be extended to add more 
functionality, so I thought I'd start rewriting it in Python on the side just 
to see how it would turn out (the idea being that if I got it done and it was 
good enough then I could simply implement the new functionality in the Python 
version :) ).

Data
====
The Java version is for the Java 1.4 platform (uses the Sun JVM), and it 
connects to an Oracle 9 database on our LAN via the thin JDBC driver provided 
by Oracle. As far as complexity goes, I'd rate the problem itself as 
non-trivial but not rocket science. :) Implementing it in Java turned out to 
be fairly involved; a lot of the complexity there revolved around efficiently 
maintaining data structures in memory that didn't get committed to the 
database until the end of each separate file. The developer for the Java 
version has about as much experience as a developer as I do, and he has about 
as much experience with Java as I do with Python (IOW, I don't consider that 
to be much of a differentiating factor - although the Java developer has 
*way* more database experience than I do).

From the start of the design to the end of the implementation and initial 
round of bug fixing took 3 weeks. An additional 2-3 weeks were spent 
optimizing the code and fixing more bugs. The source code had 4700 lines in 9 
files. When running, it would process an average of 1050 records per second 
(where process = time to parse the file, do calculations, and insert them 
into the database).

The Python version is for Python 2.1.3 and it connects to the same Oracle 
database using DCOracle2. The implementation is very straightforward, nothing 
fancy. It's also still pretty "pristine" as I haven't had any time to try and 
optimize it yet. :) One of the biggest simplifying factors in the code was 
how easy it was to have dictionaries map dynamically-defined and 
arbitrarily-complex keys (made up of object tuples) to other objects. For 
whatever reason this seemed to be a huge factor in making data management go 
from a difficult problem to almost a no-brainer. If the Python version had 
come first and the Java one second, then some of that approach could have 
made it into the Java version, but in Java I don't find myself thinking about 
problems quite the same way - IOW the Python version's approach wouldn't be 
as obvious in Java or would seem to be too much up-front work (whereas the 
approach that was used was easier up front but in the end became quite 
complex and a limiting factor to adding new functionality IMO).

Because I wasn't working on this full-time, the development was spread out 
over the course of two weeks (10 working days) at an average of just over 2 
hours per day (for a total of not quite 3 full days of work). The source code 
was less than 700 lines in 4 files. Most surprising to me was that it 
processes an average of 1200 records per second! I had assumed that after I 
got it working I'd need to spend time optimizing things to make up for Java 
having a JIT compiler to speed things up, but for now I won't bother.

Both versions could be improved by splitting the log parsing/summarizing and 
database work into two separate threads (watching the processes I see periods 
of high CPU usage followed by near-idle time while the database churns away). 
Currently the Java version averages 47% CPU utilization and the Python 
version averages 51%.

Caveats
=======
There are a million reasons why this isn't an apple-to-apple comparison or why 
somebody might read this and cry "foul!"; here's a few off the top of my 
head:

- The second time you write something you can do it better - since development 
is often part exploratory, once you're done you usually have a good idea of 
how to do it better were you to do it again. I didn't write the first 
version, but writing the test suite (and fixing the bugs it uncovered) made 
me familiar with the weaknesses in the initial version.
- It's proprietary code; nobody else can see the two versions of source - Too 
bad. ;-)
- I haven't bothered to do a "true" LOC count for both versions - I just did 
"wc -l *.py" and "wc -l *.java" to get line counts - so comments and other 
junk is included in the line totals.
- My development time didn't include much design time because my design was 
mostly a reaction to the Java implementation.
- I used the thin Java driver (written in pure Java) and DCOracle 2 (uses 
native driver on Linux) - this may affect performance some.

Conclusion
==========
Like I said before, I'd hesitate to read *too* much into this experience, 
although I will say it's a more concrete and confirming example of what I 
(and many others) have experienced before - that there are some big 
productivity gains by using Python over some other languages. It's generally 
hard to measure this sort of thing because it's rare to do a rewrite without 
adding functionality, and the larger the project the more rare this becomes, 
so even though this is a very small example it's still useful. I certainly 
wouldn't have gotten "approval" to do the simple rewrite, but doing it in my 
spare time and having a comprehensive test suite made it possible.

Anyway, for a very small development cost I ended up with a codebase 15% of 
the size of the original, and an implementation that will be far easier to 
extend (and easier to pass off to somebody else!). In order to get smaller 
and cleaner code I had been willing to take a modest performance hit, but in 
the end the new version was slightly faster too - icing on the cake!

-Dave