efficient data loading with Python, is that possible possible?
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Wed Dec 12 23:11:19 EST 2007
On Wed, 12 Dec 2007 16:44:01 -0800, igor.tatarinov wrote:
> Here is some of my code. Tell me what's wrong with it :)
>
> def loadFile(inputFile, loader):
> # .zip files don't work with zlib
Pardon?
> f = popen('zcat ' + inputFile)
> for line in f:
> loader.handleLine(line)
Do you really need to compress the file? Five million lines isn't a lot.
It depends on the length of each line, naturally, but I'd be surprised if
it were more than 100MB.
> ...
>
> In Loader class:
> def handleLine(self, line):
> # filter out 'wrong' lines
> if not self._dataFormat(line): return
Who knows what the _dataFormat() method does? How complicated is it? Why
is it a private method?
> # add a new output record
> rec = self.result.addRecord()
Who knows what this does? How complicated it is?
> for col in self._dataFormat.colFormats:
Hmmm... a moment ago, _dataFormat seemed to be a method, or at least a
callable. Now it has grown a colFormats attribute. Complicated and
confusing.
> value = parseValue(line, col)
> rec[col.attr] = value
>
> And here is parseValue (will using a hash-based dispatch make it much
> faster?):
Possibly, but not enough to reduce 20 minutes to one or two.
But you know something? Your code looks like a bad case of over-
generalisation. I assume it's a translation of your C++ code -- no wonder
it takes an entire minute to process the file! (Oh lord, did I just say
that???) Object-oriented programming is a useful tool, but sometimes you
don't need a HyperDispatcherLoaderManagerCreator, you just need a hammer.
In your earlier post, you gave the data specification:
"The text file has a fixed format (like a punchcard). The columns contain
integer, real, and date values. The output files are the same values in
binary."
Easy-peasy. First, some test data:
fp = open('BIG', 'w')
for i in xrange(5000000):
anInt = i % 3000
aBool = ['TRUE', 'YES', '1', 'Y', 'ON',
'FALSE', 'NO', '0', 'N', 'OFF'][i % 10]
aFloat = ['1.12', '-3.14', '0.0', '7.42'][i % 4]
fp.write('%s %s %s\n' % (anInt, aBool, aFloat))
if i % 45000 == 0:
# Write a comment and a blank line.
fp.write('# this is a comment\n \n')
fp.close()
Now let's process it:
import struct
# Define converters for each type of value to binary.
def fromBool(s):
"""String to boolean byte."""
s = s.upper()
if s in ('TRUE', 'YES', '1', 'Y', 'ON'):
return struct.pack('b', True)
elif s in ('FALSE', 'NO', '0', 'N', 'OFF'):
return struct.pack('b', False)
else:
raise ValueError('not a valid boolean')
def fromInt(s):
"""String to integer bytes."""
return struct.pack('l', int(s))
def fromFloat(s):
"""String to floating point bytes."""
return struct.pack('f', float(s))
# Assume three fields...
DEFAULT_FORMAT = [fromInt, fromBool, fromFloat]
# And three files...
OUTPUT_FILES = ['ints.out', 'bools.out', 'floats.out']
def process_line(s, format=DEFAULT_FORMAT):
s = s.strip()
fields = s.split() # I assume the fields are whitespace separated
assert len(fields) == len(format)
return [f(x) for (x, f) in zip(fields, format)]
def process_file(infile, outfiles=OUTPUT_FILES):
out = [open(f, 'wb') for f in outfiles]
for line in file(infile, 'r'):
# ignore leading/trailing whitespace and comments
line = line.strip()
if line and not line.startswith('#'):
fields = process_line(line)
# now write the fields to the files
for x, fp in zip(fields, out):
fp.write(x)
for f in out:
f.close()
And now let's use it and see how long it takes:
>>> import time
>>> s = time.time(); process_file('BIG'); time.time() - s
129.58465385437012
Naturally if your converters are more complex (e.g. date-time), or if you
have more fields, it will take longer to process, but then I've made no
effort at all to optimize the code.
--
Steven.
More information about the Python-list
mailing list