How to speed up reading bytes from file?
Leazen
leazen at uol.com.ar
Sat Dec 7 11:06:35 EST 2002
Brad Clements wrote:
> Hi,
>
> I have a binary file that consists of variable length "elements". I
need to
> scan over the file, reading in elements, seeking over some elements, etc.
>
> I do not want to load the entire file into memory.
>
> My "unpack" algorithm consists of reading a "byte", then optionally 1
to 15
> "chars", then some number of bytes of words (2-byte network order)
>
> my first attempt used a local scope function:
>
> def __loadIndexLevel(fHandle):
>
> def getbyte():
> return ord(fHandle.read(1))
>
>
> x = getbyte()
> v = fHandle.read(x)
> ... some more getbyte() ..
>
> But, I had to call subroutines, so was passing the getbyte function
and the
> fHandle as args, to reduce the arg count I created a new class:
>
>
> class binaryFile(file):
> """subclass file object"""
> def getbyte(self):
> return ord(self.read(1))
>
> def getword(self):
> """get 2 byte word in network order"""
>
> v = self.read(2)
> return ord(v[0]) * 256 + ord(v[1])
getword seems to be slow. You could use the struct module for that.
Try something like struct.unpack("!H", self.read(2)) (not tested).
getbyte could be better as well. Would it help making it like this (also
untested):
def getbyte(self, n = 1):
return struct.unpack("@" + "c" * n, self.read(n))
This code will be slower for n == 1 but could help for a bigger value.
> This works better but is still slow, so I thought maybe I was being
hurt by
> the ord() calls, so I created a class that uses an Array to mirror the
file
> contents.
>
> (see SeekableFileArray below)
>
> SeekableFileArray turned out to be 50% slower than the binaryFile class,
> here's some stats.
>
> Thu Dec 5 16:47:24 2002 /tmp/stat
>
> 3738953 function calls (3600138 primitive calls) in 78.590 CPU
> seconds
>
> Ordered by: call count
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 1166802 12.940 0.000 14.510 0.000 TokenIndex.py:209(getbyte)
> 527885 19.480 0.000 40.600 0.000 TokenIndex.py:115(fromBin)
> 527885 2.800 0.000 2.800 0.000 TokenIndex.py:106(__init__)
> 297067 4.440 0.000 7.070 0.000 TokenIndex.py:227(read)
> 277632 3.480 0.000 3.480 0.000 TokenIndex.py:190(tell)
> 245279 2.630 0.000 2.630 0.000 TokenIndex.py:194(getbytes)
> 138817 1.570 0.000 1.570 0.000
TokenIndex.py:176(_loadBytes)
> 138817 2.430 0.000 2.430 0.000 TokenIndex.py:236(seekback)
> 138817 16.760 0.000 64.960 0.000
> TokenIndex.py:244(__loadIndexLevel)
> 138817 2.050 0.000 2.050 0.000 TokenIndex.py:185(seek)
> 138816/1 9.790 0.000 78.410 78.410
> TokenIndex.py:337(__ExtractTokenList)
> 2307 0.050 0.000 0.050 0.000 TokenIndex.py:217(getword)
> 2 0.000 0.000 0.000 0.000 posixpath.py:191(isfile)
> 1 0.000 0.000 78.410 78.410
> TokenIndex.py:711(extractAllTokens)
> 1 0.000 0.000 78.420 78.420 TokenIndex.py:721(test)
> 1 0.010 0.010 78.590 78.590 profile:0(test())
> 1 0.160 0.160 78.580 78.580 <string>:1(?)
> 1 0.000 0.000 0.000 0.000 stat.py:54(S_ISREG)
> 1 0.000 0.000 0.000 0.000 stat.py:29(S_IFMT)
> 1 0.000 0.000 78.410 78.410
> TokenIndex.py:355(_extractAllTokens)
> 1 0.000 0.000 0.000 0.000 TokenIndex.py:170(__init__)
> 1 0.000 0.000 0.010 0.010 TokenIndex.py:612(__init__)
> 1 0.000 0.000 0.010 0.010
> TokenIndex.py:268(_getPayloadCount)
> 0 0.000 0.000 profile:0(profiler)
>
> getbyte() is from SeekableFileArray,
>
> fromBin() is a method of a ReadIndexEntry instance.
>
> There are 527885 ReadIndexEntry's created in _loadIndexLevel.
>
> Non profiled run takes 31 seconds. If I use the binaryFile class, it takes
> 22 seconds.
>
> I "unrolled" fromBin() and inlined it in _loadIndexLevel, but now I'm
losing
> abstraction. However I got the time down to 14 seconds.
>
> This is still too long (most folks won't have a XEON 2.2 GHZ machine)
>
> I could write this in C, but before I go that route I'd thought I
would ask
> for suggestions...
>
> Note that I want to continue to use Python level "file" objects, I may try
> using the mmap module in the future, so want to leave that option open.
>
> One more thing, I tried creating a generator for getbyte() in the
> SeekableFileArray, but that was 2 seconds slower than the non generator
> version.
>
> class SeekableFileArray(object):
> """Combine file object with array handling"""
> def __init__(self,fHandle,size=512):
> self.fHandle = fHandle
> self.size = size
> print "size is ",size
> self.array = array.array('B')
>
> def _loadBytes(self):
> """Load some more bytes from file"""
> try:
> self.array.fromfile(self.fHandle,self.size)
> except EOFError:
> pass
> return self.array
>
>
> def seek(self,*args,**kw):
> """seek"""
> self.array = self.array[0:0] # truncate array
> return self.fHandle.seek(*args,**kw)
>
> def tell(self,*args,**kw):
> """tell"""
> return self.fHandle.tell(*args,**kw) - len(self.array)
>
> def getbytes(self,l=1):
> """return bytes"""
>
> a = self.array
> if l > len(a):
> self._loadBytes()
>
> if l > len(a):
> raise EOFError()
> else:
> v = a[:l]
> self.array = a[l:]
>
> return v
>
> def getbyte(self):
> """return a byte"""
> try:
> return self.array.pop(0)
> except IndexError:
> self._loadBytes()
> return self.array.pop(0)
>
>
> def getword(self):
> a = self.array
> if len(a) < 2:
> self._loadBytes()
>
> b = a.pop(0)
> c = a.pop(0)
> return b * 256 + c
>
>
> def read(self,l=1):
> if l == 1:
> try:
> return chr(self.array.pop(0))
> except IndexError:
> self._loadBytes()
> return chr(self.array.pop(0))
> return self.getbytes(l).tostring()
>
> def seekback(self):
> """seek backwards in inputfile so that unconsumed array bytes
become
> available in input stream"""
> v = len(self.array)
> if v > 0:
> self.array = self.array[0:0] # truncate array
> self.fHandle.seek(-v,1)
More information about the Python-list
mailing list