Reading large multidimensional arrays off file

Fernando Pérez fperez528 at yahoo.com
Sat Apr 6 15:18:17 EST 2002


 
>         The previous process takes ages to read the file in (of the
> order of an hour for one 13 Mb file), so I was wondering why I don't
> just read the whole array in, after all, I know that it is NxMx10
> bytes, and I know the order. My question is fairly basic: How do I
> specify the order of the data? I would like to read it in the order
> shown above, into an 3-dimensional array size (10,N,M), but I am not
> sure what sizes come first or second in Python.

See if this helps as a sample start:

    def read(self,fname):
        """Read in a topological density file.

        Does NOT do byte-reversal.
        """

        # MILC code:
        TOPO_VERSION = 66051
        # of bytes to skip in file *from the beginning* before data starts
        offset = 108

        topo = open(fname)
        topo_version,nx,ny,nz,nt = struct.unpack('i'*5,topo.read(5*4))
        if topo_version != TOPO_VERSION:
            warn('Wrong topological density version number (or 
byte-reversal)!',3)

        topo.seek(offset)
        data_size = nx*ny*nz*nt*4  # these files use C float, not double
        topo_flat = fromstring(topo.read(data_size),Float32)
        topo_flat.shape = (nt,nz,ny,nx)
        with(self,nx=nx,ny=ny,nz=nz,nt=nt,data=topo_flat)
        print 'Read',data_size/4,'data points. Shape:',(nx,ny,nz,nt)


It reads binary files which represent 4-dimensional C float arrays. The key 
calls are fromstring() (there's a from Numeric import * somewhere in the 
code) and the .shape assignment later. You read the data 'flat' first, then 
make the shape assignments.

My files aren't as big as yours (under a megabyte) but the read time is 
negligible for me, so I'm sure it would be far less than one hour for a 13mb 
file.

Here's another similar reader but for big text files (~16 MB) where I need to 
scan for a tag to extract the data. The trick there is to have awk 
(super-fast) do the first pass:

    def read(self,fname,mode_num):
        """Read in a text density file.

        Assumes nx=ny=nz (but not necessarily =nt)."""

          # read nx,nt first so we know the size of the data to load
        for line in file(fname):
            if line.startswith('nx'):
                nx = ny = nz = int(line.strip().split()[1])
            if line.startswith('nt'):
                nt = int(line.strip().split()[1])
                break

        cmd = "awk '/%sPG5P/ { print $6}' %s" % (mode_num,fname)
        tfile = os.popen(cmd)
        data_a_flat = array(map(float,tfile))
        data_a_flat.shape = (nt,nz,ny,nx)

        with(self,nx=nx,ny=ny,nz=nz,nt=nt,data_a=data_a_flat)
        data_size = nx*ny*nz*nt
        print 'Read',data_size,'data points. Shape:',(nx,ny,nz,nt)

This takes 2-3 seconds to scan a 16 mb text file and extract about 1 mb of 
data from it. Let awk do fast the main scanning, then map float() onto the 
resulting in-memory string and presto, you have your array.

Hope this helps to get you started.

Cheers,

f.



More information about the Python-list mailing list