How to speed up reading bytes from file?

Sat Dec 7 11:06:35 EST 2002

Brad Clements wrote:
> Hi,
>
> I have a binary file that consists of variable length "elements". I
need to
> scan over the file, reading in elements, seeking over some elements, etc.
>
> I do not want to load the entire file into memory.
>
> My "unpack" algorithm consists of reading a "byte", then optionally 1
to 15
> "chars", then some number of bytes of words (2-byte network order)
>
> my first attempt used a local scope function:
>
> def __loadIndexLevel(fHandle):
>
>     def getbyte():
>           return ord(fHandle.read(1))
>
>
>     x = getbyte()
>     v = fHandle.read(x)
>     ... some more getbyte() ..
>
> But, I had to call subroutines, so was passing the getbyte function
and the
> fHandle as args, to reduce the arg count I created a new class:
>
>
> class   binaryFile(file):
>     """subclass file object"""
>     def getbyte(self):
>         return ord(self.read(1))
>
>     def getword(self):
>         """get 2 byte word in network order"""
>
>         v = self.read(2)
>         return ord(v[0]) * 256 + ord(v[1])

getword seems to be slow. You could use the struct module for that.
Try something like struct.unpack("!H", self.read(2)) (not tested).
getbyte could be better as well. Would it help making it like this (also
untested):

def getbyte(self, n = 1):
	return struct.unpack("@" + "c" * n, self.read(n))

This code will be slower for n == 1 but could help for a bigger value.

> This works better but is still slow, so I thought maybe I was being
hurt by
> the ord() calls, so I created a class that uses an Array to mirror the
file
> contents.
>
> (see SeekableFileArray below)
>
> SeekableFileArray turned out to be 50% slower than the binaryFile class,
> here's some stats.
>
> Thu Dec  5 16:47:24 2002    /tmp/stat
>
>          3738953 function calls (3600138 primitive calls) in 78.590 CPU
> seconds
>
>    Ordered by: call count
>
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>   1166802   12.940    0.000   14.510    0.000 TokenIndex.py:209(getbyte)
>    527885   19.480    0.000   40.600    0.000 TokenIndex.py:115(fromBin)
>    527885    2.800    0.000    2.800    0.000 TokenIndex.py:106(__init__)
>    297067    4.440    0.000    7.070    0.000 TokenIndex.py:227(read)
>    277632    3.480    0.000    3.480    0.000 TokenIndex.py:190(tell)
>    245279    2.630    0.000    2.630    0.000 TokenIndex.py:194(getbytes)
>    138817    1.570    0.000    1.570    0.000
TokenIndex.py:176(_loadBytes)
>    138817    2.430    0.000    2.430    0.000 TokenIndex.py:236(seekback)
>    138817   16.760    0.000   64.960    0.000
> TokenIndex.py:244(__loadIndexLevel)
>    138817    2.050    0.000    2.050    0.000 TokenIndex.py:185(seek)
>  138816/1    9.790    0.000   78.410   78.410
> TokenIndex.py:337(__ExtractTokenList)
>      2307    0.050    0.000    0.050    0.000 TokenIndex.py:217(getword)
>         2    0.000    0.000    0.000    0.000 posixpath.py:191(isfile)
>         1    0.000    0.000   78.410   78.410
> TokenIndex.py:711(extractAllTokens)
>         1    0.000    0.000   78.420   78.420 TokenIndex.py:721(test)
>         1    0.010    0.010   78.590   78.590 profile:0(test())
>         1    0.160    0.160   78.580   78.580 <string>:1(?)
>         1    0.000    0.000    0.000    0.000 stat.py:54(S_ISREG)
>         1    0.000    0.000    0.000    0.000 stat.py:29(S_IFMT)
>         1    0.000    0.000   78.410   78.410
> TokenIndex.py:355(_extractAllTokens)
>         1    0.000    0.000    0.000    0.000 TokenIndex.py:170(__init__)
>         1    0.000    0.000    0.010    0.010 TokenIndex.py:612(__init__)
>         1    0.000    0.000    0.010    0.010
> TokenIndex.py:268(_getPayloadCount)
>         0    0.000             0.000          profile:0(profiler)
>
> getbyte() is from  SeekableFileArray,
>
> fromBin() is a method of a ReadIndexEntry instance.
>
> There are 527885 ReadIndexEntry's created in _loadIndexLevel.
>
> Non profiled run takes 31 seconds. If I use the binaryFile class, it takes
> 22 seconds.
>
> I "unrolled" fromBin() and inlined it in _loadIndexLevel, but now I'm
losing
> abstraction. However I got the time down to 14 seconds.
>
> This is still too long (most folks won't have a XEON 2.2 GHZ machine)
>
> I could write this in C, but before I go that route I'd thought I
would ask
> for suggestions...
>
> Note that I want to continue to use Python level "file" objects, I may try
> using the mmap module in the future, so want to leave that option open.
>
> One more thing, I tried creating a generator for getbyte() in the
> SeekableFileArray, but that was 2 seconds slower than the non generator
> version.
>
> class SeekableFileArray(object):
>     """Combine file object with array handling"""
>     def __init__(self,fHandle,size=512):
>         self.fHandle = fHandle
>         self.size = size
>         print "size is ",size
>         self.array = array.array('B')
>
>     def _loadBytes(self):
>         """Load some more bytes from file"""
>         try:
>             self.array.fromfile(self.fHandle,self.size)
>         except EOFError:
>             pass
>         return self.array
>
>
>     def seek(self,*args,**kw):
>         """seek"""
>         self.array = self.array[0:0]    # truncate array
>         return self.fHandle.seek(*args,**kw)
>
>     def tell(self,*args,**kw):
>         """tell"""
>         return self.fHandle.tell(*args,**kw) - len(self.array)
>
>     def getbytes(self,l=1):
>         """return bytes"""
>
>         a = self.array
>         if  l > len(a):
>             self._loadBytes()
>
>         if l > len(a):
>             raise EOFError()
>         else:
>             v = a[:l]
>             self.array = a[l:]
>
>         return v
>
>     def getbyte(self):
>         """return a byte"""
>         try:
>             return self.array.pop(0)
>         except IndexError:
>             self._loadBytes()
>             return self.array.pop(0)
>
>
>     def getword(self):
>         a = self.array
>         if len(a) < 2:
>             self._loadBytes()
>
>         b = a.pop(0)
>         c = a.pop(0)
>         return  b * 256 + c
>
>
>     def read(self,l=1):
>         if l == 1:
>             try:
>                 return chr(self.array.pop(0))
>             except IndexError:
>                 self._loadBytes()
>                 return chr(self.array.pop(0))
>         return self.getbytes(l).tostring()
>
>     def seekback(self):
>         """seek backwards in inputfile so that unconsumed array bytes
become
> available in input stream"""
>         v = len(self.array)
>         if v > 0:
>             self.array = self.array[0:0]    # truncate array
>             self.fHandle.seek(-v,1)