Bulding arrays from text file data.

Wed May 1 04:16:43 EDT 2002

Joe Woodward wrote:

> Does anyone know of a faster way to build an array out of text file
> with formatted data. What I have been doing works well, but now my
> files and 200MB and up. The following way builds a Numeric.array
> without knowing the size to start with. I can then resize it after the
> fact.
> 
> 
> yldpos=Numeric.array(
  map(float,re.split('\s+',string.strip(open('datafile.txt').read()))))

The two importantly-different approaches are: either you read all of
your file into memory first, and then process that huge string, or you
loop.  The former approach tends to work quite fast as long as all of
your data fits comfortably into physical memory, but its performance
is "fragile" -- you risk 'thrashing' your VM system and degrading the
performance terribly as soon as your data sizes are too large.

The little-at-a-time approach may therefore be preferable for files
that are large enough.  It takes some care about HOW you grow the
array you're preparing as new data comes in; if you grow too little
at a time you may fall into quadratic behavior.

Python is pretty clever about how it grows its own lists, so you might
want to try an iterator-based approach (need Python 2.2) and see how
it fares as a compromise, e.g.:

from __future__ import generators
def allwords(fileobj):
    for line in fileobj:
        for word in line.split():
            yield word
yldpos=Numeric.array(map(float,allwords(open('datafile.txt'))))

compared to an entirely list-based approach:

yldpos=Numeric.array( [ float(x) for line in open('datafile.txt')
    for x in line.split() ] )

These are both based on iterating on the file by-line rather than
gulping the whole file down at once.  The equivalent based on
the whole-gulp idea might be:

yldpos=Numeric.array(map(float,open('datafile.txt').read().split()))

Try each, and see how the performance goes...

Alex