How to speed up reading bytes from file?

Brad Clements bkc at Murkworks.com
Fri Dec 6 10:39:00 EST 2002


Hi,

I have a binary file that consists of variable length "elements". I need to
scan over the file, reading in elements, seeking over some elements, etc.

I do not want to load the entire file into memory.

My "unpack" algorithm consists of reading a "byte", then optionally 1 to 15
"chars", then some number of bytes of words (2-byte network order)

my first attempt used a local scope function:

def __loadIndexLevel(fHandle):

    def getbyte():
          return ord(fHandle.read(1))


    x = getbyte()
    v = fHandle.read(x)
    ... some more getbyte() ..

But, I had to call subroutines, so was passing the getbyte function and the
fHandle as args, to reduce the arg count I created a new class:


class   binaryFile(file):
    """subclass file object"""
    def getbyte(self):
        return ord(self.read(1))

    def getword(self):
        """get 2 byte word in network order"""

        v = self.read(2)
        return ord(v[0]) * 256 + ord(v[1])

This works better but is still slow, so I thought maybe I was being hurt by
the ord() calls, so I created a class that uses an Array to mirror the file
contents.

(see SeekableFileArray below)

SeekableFileArray turned out to be 50% slower than the binaryFile class,
here's some stats.

Thu Dec  5 16:47:24 2002    /tmp/stat

         3738953 function calls (3600138 primitive calls) in 78.590 CPU
seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1166802   12.940    0.000   14.510    0.000 TokenIndex.py:209(getbyte)
   527885   19.480    0.000   40.600    0.000 TokenIndex.py:115(fromBin)
   527885    2.800    0.000    2.800    0.000 TokenIndex.py:106(__init__)
   297067    4.440    0.000    7.070    0.000 TokenIndex.py:227(read)
   277632    3.480    0.000    3.480    0.000 TokenIndex.py:190(tell)
   245279    2.630    0.000    2.630    0.000 TokenIndex.py:194(getbytes)
   138817    1.570    0.000    1.570    0.000 TokenIndex.py:176(_loadBytes)
   138817    2.430    0.000    2.430    0.000 TokenIndex.py:236(seekback)
   138817   16.760    0.000   64.960    0.000
TokenIndex.py:244(__loadIndexLevel)
   138817    2.050    0.000    2.050    0.000 TokenIndex.py:185(seek)
 138816/1    9.790    0.000   78.410   78.410
TokenIndex.py:337(__ExtractTokenList)
     2307    0.050    0.000    0.050    0.000 TokenIndex.py:217(getword)
        2    0.000    0.000    0.000    0.000 posixpath.py:191(isfile)
        1    0.000    0.000   78.410   78.410
TokenIndex.py:711(extractAllTokens)
        1    0.000    0.000   78.420   78.420 TokenIndex.py:721(test)
        1    0.010    0.010   78.590   78.590 profile:0(test())
        1    0.160    0.160   78.580   78.580 <string>:1(?)
        1    0.000    0.000    0.000    0.000 stat.py:54(S_ISREG)
        1    0.000    0.000    0.000    0.000 stat.py:29(S_IFMT)
        1    0.000    0.000   78.410   78.410
TokenIndex.py:355(_extractAllTokens)
        1    0.000    0.000    0.000    0.000 TokenIndex.py:170(__init__)
        1    0.000    0.000    0.010    0.010 TokenIndex.py:612(__init__)
        1    0.000    0.000    0.010    0.010
TokenIndex.py:268(_getPayloadCount)
        0    0.000             0.000          profile:0(profiler)

getbyte() is from  SeekableFileArray,

fromBin() is a method of a ReadIndexEntry instance.

There are 527885 ReadIndexEntry's created in _loadIndexLevel.

Non profiled run takes 31 seconds. If I use the binaryFile class, it takes
22 seconds.

I "unrolled" fromBin() and inlined it in _loadIndexLevel, but now I'm losing
abstraction. However I got the time down to 14 seconds.

This is still too long (most folks won't have a XEON 2.2 GHZ machine)

I could write this in C, but before I go that route I'd thought I would ask
for suggestions...

Note that I want to continue to use Python level "file" objects, I may try
using the mmap module in the future, so want to leave that option open.

One more thing, I tried creating a generator for getbyte() in the
SeekableFileArray, but that was 2 seconds slower than the non generator
version.

class SeekableFileArray(object):
    """Combine file object with array handling"""
    def __init__(self,fHandle,size=512):
        self.fHandle = fHandle
        self.size = size
        print "size is ",size
        self.array = array.array('B')

    def _loadBytes(self):
        """Load some more bytes from file"""
        try:
            self.array.fromfile(self.fHandle,self.size)
        except EOFError:
            pass
        return self.array


    def seek(self,*args,**kw):
        """seek"""
        self.array = self.array[0:0]    # truncate array
        return self.fHandle.seek(*args,**kw)

    def tell(self,*args,**kw):
        """tell"""
        return self.fHandle.tell(*args,**kw) - len(self.array)

    def getbytes(self,l=1):
        """return bytes"""

        a = self.array
        if  l > len(a):
            self._loadBytes()

        if l > len(a):
            raise EOFError()
        else:
            v = a[:l]
            self.array = a[l:]

        return v

    def getbyte(self):
        """return a byte"""
        try:
            return self.array.pop(0)
        except IndexError:
            self._loadBytes()
            return self.array.pop(0)


    def getword(self):
        a = self.array
        if len(a) < 2:
            self._loadBytes()

        b = a.pop(0)
        c = a.pop(0)
        return  b * 256 + c


    def read(self,l=1):
        if l == 1:
            try:
                return chr(self.array.pop(0))
            except IndexError:
                self._loadBytes()
                return chr(self.array.pop(0))
        return self.getbytes(l).tostring()

    def seekback(self):
        """seek backwards in inputfile so that unconsumed array bytes become
available in input stream"""
        v = len(self.array)
        if v > 0:
            self.array = self.array[0:0]    # truncate array
            self.fHandle.seek(-v,1)





-----------== Posted via Newsfeed.Com - Uncensored Usenet News ==----------
   http://www.newsfeed.com       The #1 Newsgroup Service in the World!
-----= Over 100,000 Newsgroups - Unlimited Fast Downloads - 19 Servers =-----



More information about the Python-list mailing list