efficient data loading with Python, is that possible possible?

Wed Dec 12 23:11:19 EST 2007

On Wed, 12 Dec 2007 16:44:01 -0800, igor.tatarinov wrote:

> Here is some of my code. Tell me what's wrong with it :)
> 
> def loadFile(inputFile, loader):
>     # .zip files don't work with zlib

Pardon?

>     f = popen('zcat ' + inputFile)
>     for line in f:
>         loader.handleLine(line)

Do you really need to compress the file? Five million lines isn't a lot. 
It depends on the length of each line, naturally, but I'd be surprised if 
it were more than 100MB.

>     ...
> 
> In Loader class:
> def handleLine(self, line):
>     # filter out 'wrong' lines
>     if not self._dataFormat(line): return

Who knows what the _dataFormat() method does? How complicated is it? Why 
is it a private method?

>     # add a new output record
>     rec = self.result.addRecord()

Who knows what this does? How complicated it is?

>     for col in self._dataFormat.colFormats:

Hmmm... a moment ago, _dataFormat seemed to be a method, or at least a 
callable. Now it has grown a colFormats attribute. Complicated and 
confusing.

>         value = parseValue(line, col)
>         rec[col.attr] = value
> 
> And here is parseValue (will using a hash-based dispatch make it much
> faster?):

Possibly, but not enough to reduce 20 minutes to one or two.

But you know something? Your code looks like a bad case of over-
generalisation. I assume it's a translation of your C++ code -- no wonder 
it takes an entire minute to process the file! (Oh lord, did I just say 
that???) Object-oriented programming is a useful tool, but sometimes you 
don't need a HyperDispatcherLoaderManagerCreator, you just need a hammer.

In your earlier post, you gave the data specification:

"The text file has a fixed format (like a punchcard). The columns contain 
integer, real, and date values. The output files are the same values in 
binary."

Easy-peasy. First, some test data:

fp = open('BIG', 'w')
for i in xrange(5000000):
    anInt = i % 3000
    aBool = ['TRUE', 'YES', '1', 'Y', 'ON', 
        'FALSE', 'NO', '0', 'N', 'OFF'][i % 10]
    aFloat = ['1.12', '-3.14', '0.0', '7.42'][i % 4]
    fp.write('%s   %s %s\n' % (anInt, aBool, aFloat))
    if i % 45000 == 0:
        # Write a comment and a blank line.
        fp.write('# this is a comment\n   \n')

fp.close()

Now let's process it:

import struct

# Define converters for each type of value to binary.
def fromBool(s):
    """String to boolean byte."""
    s = s.upper()
    if s in ('TRUE', 'YES', '1', 'Y', 'ON'):
        return struct.pack('b', True)
    elif s in ('FALSE', 'NO', '0', 'N', 'OFF'):
        return struct.pack('b', False)
    else:
        raise ValueError('not a valid boolean')

def fromInt(s):
    """String to integer bytes."""
    return struct.pack('l', int(s))

def fromFloat(s):
    """String to floating point bytes."""
    return struct.pack('f', float(s))

# Assume three fields...
DEFAULT_FORMAT = [fromInt, fromBool, fromFloat]

# And three files...
OUTPUT_FILES = ['ints.out', 'bools.out', 'floats.out']

def process_line(s, format=DEFAULT_FORMAT):
    s = s.strip()
    fields = s.split() # I assume the fields are whitespace separated
    assert len(fields) == len(format)
    return [f(x) for (x, f) in zip(fields, format)]

def process_file(infile, outfiles=OUTPUT_FILES):
    out = [open(f, 'wb') for f in outfiles]
    for line in file(infile, 'r'):
        # ignore leading/trailing whitespace and comments
        line = line.strip()
        if line and not line.startswith('#'):
            fields = process_line(line)
            # now write the fields to the files
            for x, fp in zip(fields, out):
                fp.write(x)
    for f in out:
        f.close()

And now let's use it and see how long it takes:

>>> import time
>>> s = time.time(); process_file('BIG'); time.time() - s
129.58465385437012

Naturally if your converters are more complex (e.g. date-time), or if you 
have more fields, it will take longer to process, but then I've made no 
effort at all to optimize the code.

-- 
Steven.