Python "compiler" is too slow for processing large data files???

Wed Aug 28 13:35:17 EDT 2002

Hi all:

I'm writing a fairly simple app that loads (text) data files, lets you edit
the data, and then saves them back out.  Rather than parsing the data myself
when loading a new file, and building data structures for the program's data
that way, I decided to try to 'import' (or 'exec', actually) the data into
the app.  The idea was to be able to format my data files as python code
(see below), and then let the python compiler do the parsing.  In the
future, I could actually put 'def' and 'class' statements right into the
data file to capture some behavior along with the data.

Simple example - I can import or exec this file to load my data (my real app
has int, float, and string data):
------ try5a3.py --------
list1 = [
    (323, 870, 46, ),
    (810, 336, 271, ),
    (572, 55, 596, ),
    (337, 256, 629, ),
    (31, 702, 16, ),
]
print len(list1)
---------------------------

Anyway, as my data files went from just a few lines, up to about 8000 lines
(with 10 values in each line for total of about 450KB of text), the time to
'exec' the file became too slow (e.g. 15 seconds) and used too much memory
(e.g. 50MB) (using ms-windows, python 2.2.1).  It is the "compile" phase,
because if I re-run, and there is *.pyc file available, the import goes very
fast (no compilation required).

Since I'm using this method to load data files into an object-editor style
application, I don't want my files to take this long to load.  I can work
around this issue by do repeated 'exec' (or even 'eval') statements, instead
of one big 'import' or 'exec', but that's more work and more limiting than
just loading the data in one statement.  I can also parse the data file
myself and build the data structures, which does run quickly, and is the
obvious way to load data into an editor app, but I was hoping to use python
itself as the language (file format) for my data files.

My tool is part of a chain of data processing, that is why I prefer to use
text data files rather than binary files:  the upstream processes can easily
export data as text files that provide initial input to my tool.  Also, it's
just nice to have human-readable data files, if at all possible.

Any ideas?  It's kind of unusual to compile big chunks of data like
this, but I was hoping it would be possible to standardize on using python
as the file format for new tools I develop.  Has anyone else tried to do
this and found trouble compiling large amounts of data?  Any work-arounds
that would allow the python compiler to handle bigger chunks of data like
this?

Ron Horn
rchii at lycos.com

---------------------------------------------------
For Fun: here is a simple program that I used to generate test cases for
testing the import of large amounts of data
---------------------------------------------------
# gen.py
# create a script to generate modules that will stress
# out the python compiler a bit?

import sys, random

def GenLine(numArgs, outfile):
    outfile.write("    (")
    for i in range(numArgs):
        outfile.write("%d, " % random.randint(0,1000))
    outfile.write("),\n")

def Generate(numLines, numArgs, outfile):
    outfile.write("list1 = [\n")
    for i in range(numLines):
        GenLine(numArgs, outfile)
    outfile.write("]\n")
    outfile.write("print len(list1)\n")

def main():
    outfile = None
    try:
        if len(sys.argv) != 4:
            print "Usage: gen <lines> <args> <outfile>"
            return

        numLines = int(sys.argv[1])
        numArgs = int(sys.argv[2])
        outfile = open(sys.argv[3], 'wt')

        Generate(numLines, numArgs, outfile)

    finally:
        if outfile: outfile.close()

if __name__ == "__main__":
    main()

#----------------------------------------------------------