[Numpy-discussion] reading *big* inhomogenous text matrices *fast*?
Daniel Lenski
dlenski at gmail.com
Wed Aug 13 21:44:13 EDT 2008
On Wed, 13 Aug 2008 16:57:32 -0400, Zachary Pincus wrote:
> Your approach generates numerous large temporary arrays and lists. If
> the files are large, the slowdown could be because all that memory
> allocation is causing some VM thrashing. I've run into that at times
> parsing large text files.
Thanks, Zach. I do think you have the right explanation for what was
wrong with my code.
I thought the slowdown was due to the overhead of interpreted code. So I
tried to do everything in list comprehensions and array statements rather
than explicit Python loops. But your were definitely right, the slowdown
was due to memory use, not interpreted code.
> Perhaps better would be to iterate through the file, building up your
> cells dictionary incrementally. Finally, once the file is read in
> fully, you could convert what you can to arrays...
>
> f = open('big_file')
> header = f.readline()
> cells = {'tet':[], 'hex':[], 'quad':[]} for line in f:
> vals = line.split()
> index_property = vals[:2]
> type=vals[3]
> nodes = vals[3:]
> cells[type].append(index_property+nodes)
> for type, vals in cells:
> cells[type] = numpy.array(vals, dtype=int)
This is similar to what I tried originally! Unfortunately, repeatedly
appending to a list seems to be very slow... I guess Python keeps
reallocating and copying the list as it grows. (It would be nice to be
able to tune the increments by which the list size increases.)
> I'm not sure if this is exactly what you want, but you get the idea...
> Anyhow, the above only uses about twice as much RAM as the size of the
> file. Your approach looks like it uses several times more than that.
>
> Also you could see if:
> cells[type].append(numpy.array([index, property]+nodes, dtype=int))
>
> is faster than what's above... it's worth testing.
Repeatedly concatenating arrays with numpy.append or numpy.concatenate is
also quite slow, unfortunately. :-(
> If even that's too slow, maybe you'll need to do this in C? That
> shouldn't be too hard, really.
Yeah, I eventually came up with a decent solution Python solution:
preallocate the arrays to the maximum size that might be needed. Trim
them down afterwards. This is very wasteful of memory when there may be
many cell types (less so if the OS does lazy allocation), but in the
typical case of only a few cell types it works great:
def _read_cells(self, f, n, debug=False):
cells = dict()
count = dict()
curtype = None
for i in xrange(n):
cell = f.readline().split()
celltype = cell[2]
if celltype!=curtype:
curtype = celltype
if curtype not in cells:
# allocate as big an array as might possibly be needed
cells[curtype] = N.empty((n-i, len(cell)-1),
dtype=int)
count[curtype] = 0
block = cells[curtype]
# put the line just read into the preallocated array
block[count[curtype]] = cell[:2]+cell[3:]
count[curtype] += 1
# trim the arrays down to size actually used
for k in cells:
cells[k] = cells[k][:count[k]].T
return cells
I hope this recipe may prove useful to others. It would be nice if NumPy
had a built-in facility for arrays that intelligently expend their
allocation as they grow. But I suppose that reading from badly-designed
file formats would be one of the only applications for it :-(
Dan
More information about the NumPy-Discussion
mailing list