efficient data loading with Python, is that possible possible?
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Wed Dec 12 20:52:53 EST 2007
On Wed, 12 Dec 2007 14:48:03 -0800, igor.tatarinov wrote:
> Hi, I am pretty new to Python and trying to use it for a relatively
> simple problem of loading a 5 million line text file and converting it
> into a few binary files. The text file has a fixed format (like a
> punchcard). The columns contain integer, real, and date values. The
> output files are the same values in binary. I have to parse the values
> and write the binary tuples out into the correct file based on a given
> column. It's a little more involved but that's not important.
I suspect that this actually is important, and that your slowdown has
everything to do with the stuff you dismiss and nothing to do with
Python's object model or execution speed.
> I have a C++ prototype of the parsing code and it loads a 5 Mline file
> in about a minute. I was expecting the Python version to be 3-4 times
> slower and I can live with that. Unfortunately, it's 20 times slower and
> I don't see how I can fix that.
I've run a quick test on my machine with a mere 1GB of RAM, reading the
entire file into memory at once, and then doing some quick processing on
each line:
>>> def make_big_file(name, size=5000000):
... fp = open(name, 'w')
... for i in xrange(size):
... fp.write('here is a bunch of text with a newline\n')
... fp.close()
...
>>> make_big_file('BIG')
>>>
>>> def test(name):
... import time
... start = time.time()
... fp = open(name, 'r')
... for line in fp.readlines():
... line = line.strip()
... words = line.split()
... fp.close()
... return time.time() - start
...
>>> test('BIG')
22.53150200843811
Twenty two seconds to read five million lines and split them into words.
I suggest the other nineteen minutes and forty-odd seconds your code is
taking has something to do with your code and not Python's execution
speed.
Of course, I wouldn't normally read all 5M lines into memory in one big
chunk. Replace the code
for line in fp.readlines():
with
for line in fp:
and the time drops from 22 seconds to 16.
--
Steven
More information about the Python-list
mailing list