[Numpy-discussion] reading *big* inhomogenous text matrices *fast*?

Zachary Pincus zachary.pincus at yale.edu
Wed Aug 13 16:57:32 EDT 2008


Hi Dan,

Your approach generates numerous large temporary arrays and lists. If  
the files are large, the slowdown could be because all that memory  
allocation is causing some VM thrashing. I've run into that at times  
parsing large text files.

Perhaps better would be to iterate through the file, building up your  
cells dictionary  incrementally. Finally, once the file is read in  
fully, you could convert what you can to arrays...

f = open('big_file')
header = f.readline()
cells = {'tet':[], 'hex':[], 'quad':[]}
for line in f:
   vals = line.split()
   index_property = vals[:2]
   type=vals[3]
   nodes = vals[3:]
   cells[type].append(index_property+nodes)
for type, vals in cells:
   cells[type] = numpy.array(vals, dtype=int)

I'm not sure if this is exactly what you want, but you get the idea...  
Anyhow, the above only uses about twice as much RAM as the size of the  
file. Your approach looks like it uses several times more than that.

Also you could see if:
   cells[type].append(numpy.array([index, property]+nodes, dtype=int))

is faster than what's above... it's worth testing.

If even that's too slow, maybe you'll need to do this in C? That  
shouldn't be too hard, really.

Zach





On Aug 13, 2008, at 3:56 PM, Dan Lenski wrote:

> Hi all,
> I'm using NumPy to read and process data from ASCII UCD files.  This  
> is a
> file format for describing unstructured finite-element meshes.
>
> Most of the file consists of rectangular, numerical text matrices,  
> easily
> and efficiently read with loadtxt().  But there is one particularly  
> nasty
> section that consists of matrices with variable numbers of columns,  
> like
> this:
>
> # index property type nodes
> 1       1        tet  620 583 1578 1792
> 2       1        tet  656 551 553 566
> 3       1        tet  1565 766 1600 1646
> 4       1        tet  1545 631 1566 1665
> 5       1        hex  1531 1512 1559 1647 1648 1732
> 6       1        hex  777 1536 1556 1599 1601 1701
> 7       1        quad 296 1568 1535 1604
> 8       1        quad 54 711 285 666
>
> As you might guess, the "type" label in the third column does indicate
> the number of following columns.
>
> Some of my files contain sections like this of *more than 1 million
> lines*, so I need to be able to read them fast.  I have not yet come  
> up
> with a good way to do this.  What I do right now is I split them up  
> into
> separate arrays based on the "type" label:
>
> lines = [f.next() for i in range(n)]
> lines = [l.split(None, 3) for l in lines]
> id, prop, types, nodes = apply(zip, lines) # THIS TAKES /FOREVER/
>
> id = array(id, dtype=uint)
> prop = array(id, dtype=uint)
> types = array(types)
>
> cells = {}
> for t in N.unique(types):
> these = N.nonzero(types==t)
> # THIS NEXT LINE TAKES FOREVER
> these_nodes = array([nodes[ii].split() for ii in these], dtype=uint).T
> cells[t] = N.row_stack(( id[these], prop[these], these_nodes ))
>
> This is really pretty slow and sub-optimal.  Has anyone developed a  
> more
> efficient way to read arrays with variable numbers of columns???
>
> Dan
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion




More information about the NumPy-Discussion mailing list