[Numpy-discussion] reading *big* inhomogenous text matrices *fast*?

Daniel Lenski dlenski at gmail.com
Wed Aug 13 21:44:13 EDT 2008


On Wed, 13 Aug 2008 16:57:32 -0400, Zachary Pincus wrote:
> Your approach generates numerous large temporary arrays and lists. If
> the files are large, the slowdown could be because all that memory
> allocation is causing some VM thrashing. I've run into that at times
> parsing large text files.

Thanks, Zach.  I do think you have the right explanation for what was 
wrong with my code.  

I thought the slowdown was due to the overhead of interpreted code.  So I 
tried to do everything in list comprehensions and array statements rather 
than explicit Python loops.  But your were definitely right, the slowdown 
was due to memory use, not interpreted code.

> Perhaps better would be to iterate through the file, building up your
> cells dictionary  incrementally. Finally, once the file is read in
> fully, you could convert what you can to arrays...
> 
> f = open('big_file')
> header = f.readline()
> cells = {'tet':[], 'hex':[], 'quad':[]} for line in f:
>    vals = line.split()
>    index_property = vals[:2]
>    type=vals[3]
>    nodes = vals[3:]
>    cells[type].append(index_property+nodes)
> for type, vals in cells:
>    cells[type] = numpy.array(vals, dtype=int)

This is similar to what I tried originally!  Unfortunately, repeatedly 
appending to a list seems to be very slow... I guess Python keeps 
reallocating and copying the list as it grows.  (It would be nice to be 
able to tune the increments by which the list size increases.)

> I'm not sure if this is exactly what you want, but you get the idea...
> Anyhow, the above only uses about twice as much RAM as the size of the
> file. Your approach looks like it uses several times more than that.
> 
> Also you could see if:
>    cells[type].append(numpy.array([index, property]+nodes, dtype=int))
> 
> is faster than what's above... it's worth testing.

Repeatedly concatenating arrays with numpy.append or numpy.concatenate is 
also quite slow, unfortunately. :-(

> If even that's too slow, maybe you'll need to do this in C? That
> shouldn't be too hard, really.

Yeah, I eventually came up with a decent solution Python solution:
preallocate the arrays to the maximum size that might be needed.  Trim 
them down afterwards.  This is very wasteful of memory when there may be 
many cell types (less so if the OS does lazy allocation), but in the 
typical case of only a few cell types it works great:

    def _read_cells(self, f, n, debug=False):
        cells = dict()
        count = dict()
        curtype = None

        for i in xrange(n):
            cell = f.readline().split()
            celltype = cell[2]

            if celltype!=curtype:
                curtype = celltype
                if curtype not in cells:
                    # allocate as big an array as might possibly be needed
                    cells[curtype] = N.empty((n-i, len(cell)-1),
                                             dtype=int)
                    count[curtype] = 0
                block = cells[curtype]

            # put the line just read into the preallocated array
            block[count[curtype]] = cell[:2]+cell[3:]
            count[curtype] += 1

        # trim the arrays down to size actually used
        for k in cells:
            cells[k] = cells[k][:count[k]].T

        return cells

I hope this recipe may prove useful to others.  It would be nice if NumPy 
had a built-in facility for arrays that intelligently expend their 
allocation as they grow.  But I suppose that reading from badly-designed 
file formats would be one of the only applications for it :-(

Dan




More information about the NumPy-Discussion mailing list