efficient data loading with Python, is that possible possible?

Wed Dec 12 17:48:03 EST 2007

Hi, I am pretty new to Python and trying to use it for a relatively
simple problem of loading a 5 million line text file and converting it
into a few binary files. The text file has a fixed format (like a
punchcard). The columns contain integer, real, and date values. The
output files are the same values in binary. I have to parse the values
and write the binary tuples out into the correct file based on a given
column. It's a little more involved but that's not important.

I have a C++ prototype of the parsing code and it loads a 5 Mline file
in about a minute. I was expecting the Python version to be 3-4 times
slower and I can live with that. Unfortunately, it's 20 times slower
and I don't see how I can fix that.

The fundamental difference is that in C++, I create a single object (a
line buffer) that's reused for each input line and column values are
extracted straight from that buffer without creating new string
objects. In python, new objects must be created and destroyed by the
million which must incur serious memory management overhead.

Correct me if I am wrong but

1) for line in file: ...
will create a new string object for every input line

2) line[start:end]
will create a new string object as well

3) int(time.mktime(time.strptime(s, "%m%d%y%H%M%S")))
will create 10 objects (since struct_time has 8 fields)

4) a simple test: line[i:j] + line[m:n] in hash
creates 3 strings and there is no way to avoid that.

I thought arrays would help but I can't load an array without creating
a string first: ar(line, start, end) is not supported.

I hope I am missing something. I really like Python but if there is no
way to process data efficiently, that seems to be a problem.

Thanks,
igor