Reading large multidimensional arrays off file
Fernando PĂ©rez
fperez528 at yahoo.com
Sat Apr 6 15:18:17 EST 2002
> The previous process takes ages to read the file in (of the
> order of an hour for one 13 Mb file), so I was wondering why I don't
> just read the whole array in, after all, I know that it is NxMx10
> bytes, and I know the order. My question is fairly basic: How do I
> specify the order of the data? I would like to read it in the order
> shown above, into an 3-dimensional array size (10,N,M), but I am not
> sure what sizes come first or second in Python.
See if this helps as a sample start:
def read(self,fname):
"""Read in a topological density file.
Does NOT do byte-reversal.
"""
# MILC code:
TOPO_VERSION = 66051
# of bytes to skip in file *from the beginning* before data starts
offset = 108
topo = open(fname)
topo_version,nx,ny,nz,nt = struct.unpack('i'*5,topo.read(5*4))
if topo_version != TOPO_VERSION:
warn('Wrong topological density version number (or
byte-reversal)!',3)
topo.seek(offset)
data_size = nx*ny*nz*nt*4 # these files use C float, not double
topo_flat = fromstring(topo.read(data_size),Float32)
topo_flat.shape = (nt,nz,ny,nx)
with(self,nx=nx,ny=ny,nz=nz,nt=nt,data=topo_flat)
print 'Read',data_size/4,'data points. Shape:',(nx,ny,nz,nt)
It reads binary files which represent 4-dimensional C float arrays. The key
calls are fromstring() (there's a from Numeric import * somewhere in the
code) and the .shape assignment later. You read the data 'flat' first, then
make the shape assignments.
My files aren't as big as yours (under a megabyte) but the read time is
negligible for me, so I'm sure it would be far less than one hour for a 13mb
file.
Here's another similar reader but for big text files (~16 MB) where I need to
scan for a tag to extract the data. The trick there is to have awk
(super-fast) do the first pass:
def read(self,fname,mode_num):
"""Read in a text density file.
Assumes nx=ny=nz (but not necessarily =nt)."""
# read nx,nt first so we know the size of the data to load
for line in file(fname):
if line.startswith('nx'):
nx = ny = nz = int(line.strip().split()[1])
if line.startswith('nt'):
nt = int(line.strip().split()[1])
break
cmd = "awk '/%sPG5P/ { print $6}' %s" % (mode_num,fname)
tfile = os.popen(cmd)
data_a_flat = array(map(float,tfile))
data_a_flat.shape = (nt,nz,ny,nx)
with(self,nx=nx,ny=ny,nz=nz,nt=nt,data_a=data_a_flat)
data_size = nx*ny*nz*nt
print 'Read',data_size,'data points. Shape:',(nx,ny,nz,nt)
This takes 2-3 seconds to scan a 16 mb text file and extract about 1 mb of
data from it. Let awk do fast the main scanning, then map float() onto the
resulting in-memory string and presto, you have your array.
Hope this helps to get you started.
Cheers,
f.
More information about the Python-list
mailing list