[Numpy-discussion] reading *big* inhomogenous text matrices *fast*?
Zachary Pincus
zachary.pincus at yale.edu
Wed Aug 13 16:57:32 EDT 2008
Hi Dan,
Your approach generates numerous large temporary arrays and lists. If
the files are large, the slowdown could be because all that memory
allocation is causing some VM thrashing. I've run into that at times
parsing large text files.
Perhaps better would be to iterate through the file, building up your
cells dictionary incrementally. Finally, once the file is read in
fully, you could convert what you can to arrays...
f = open('big_file')
header = f.readline()
cells = {'tet':[], 'hex':[], 'quad':[]}
for line in f:
vals = line.split()
index_property = vals[:2]
type=vals[3]
nodes = vals[3:]
cells[type].append(index_property+nodes)
for type, vals in cells:
cells[type] = numpy.array(vals, dtype=int)
I'm not sure if this is exactly what you want, but you get the idea...
Anyhow, the above only uses about twice as much RAM as the size of the
file. Your approach looks like it uses several times more than that.
Also you could see if:
cells[type].append(numpy.array([index, property]+nodes, dtype=int))
is faster than what's above... it's worth testing.
If even that's too slow, maybe you'll need to do this in C? That
shouldn't be too hard, really.
Zach
On Aug 13, 2008, at 3:56 PM, Dan Lenski wrote:
> Hi all,
> I'm using NumPy to read and process data from ASCII UCD files. This
> is a
> file format for describing unstructured finite-element meshes.
>
> Most of the file consists of rectangular, numerical text matrices,
> easily
> and efficiently read with loadtxt(). But there is one particularly
> nasty
> section that consists of matrices with variable numbers of columns,
> like
> this:
>
> # index property type nodes
> 1 1 tet 620 583 1578 1792
> 2 1 tet 656 551 553 566
> 3 1 tet 1565 766 1600 1646
> 4 1 tet 1545 631 1566 1665
> 5 1 hex 1531 1512 1559 1647 1648 1732
> 6 1 hex 777 1536 1556 1599 1601 1701
> 7 1 quad 296 1568 1535 1604
> 8 1 quad 54 711 285 666
>
> As you might guess, the "type" label in the third column does indicate
> the number of following columns.
>
> Some of my files contain sections like this of *more than 1 million
> lines*, so I need to be able to read them fast. I have not yet come
> up
> with a good way to do this. What I do right now is I split them up
> into
> separate arrays based on the "type" label:
>
> lines = [f.next() for i in range(n)]
> lines = [l.split(None, 3) for l in lines]
> id, prop, types, nodes = apply(zip, lines) # THIS TAKES /FOREVER/
>
> id = array(id, dtype=uint)
> prop = array(id, dtype=uint)
> types = array(types)
>
> cells = {}
> for t in N.unique(types):
> these = N.nonzero(types==t)
> # THIS NEXT LINE TAKES FOREVER
> these_nodes = array([nodes[ii].split() for ii in these], dtype=uint).T
> cells[t] = N.row_stack(( id[these], prop[these], these_nodes ))
>
> This is really pretty slow and sub-optimal. Has anyone developed a
> more
> efficient way to read arrays with variable numbers of columns???
>
> Dan
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
More information about the NumPy-Discussion
mailing list