How to speed up reading bytes from file?
Brad Clements
bkc at Murkworks.com
Fri Dec 6 10:39:00 EST 2002
Hi,
I have a binary file that consists of variable length "elements". I need to
scan over the file, reading in elements, seeking over some elements, etc.
I do not want to load the entire file into memory.
My "unpack" algorithm consists of reading a "byte", then optionally 1 to 15
"chars", then some number of bytes of words (2-byte network order)
my first attempt used a local scope function:
def __loadIndexLevel(fHandle):
def getbyte():
return ord(fHandle.read(1))
x = getbyte()
v = fHandle.read(x)
... some more getbyte() ..
But, I had to call subroutines, so was passing the getbyte function and the
fHandle as args, to reduce the arg count I created a new class:
class binaryFile(file):
"""subclass file object"""
def getbyte(self):
return ord(self.read(1))
def getword(self):
"""get 2 byte word in network order"""
v = self.read(2)
return ord(v[0]) * 256 + ord(v[1])
This works better but is still slow, so I thought maybe I was being hurt by
the ord() calls, so I created a class that uses an Array to mirror the file
contents.
(see SeekableFileArray below)
SeekableFileArray turned out to be 50% slower than the binaryFile class,
here's some stats.
Thu Dec 5 16:47:24 2002 /tmp/stat
3738953 function calls (3600138 primitive calls) in 78.590 CPU
seconds
Ordered by: call count
ncalls tottime percall cumtime percall filename:lineno(function)
1166802 12.940 0.000 14.510 0.000 TokenIndex.py:209(getbyte)
527885 19.480 0.000 40.600 0.000 TokenIndex.py:115(fromBin)
527885 2.800 0.000 2.800 0.000 TokenIndex.py:106(__init__)
297067 4.440 0.000 7.070 0.000 TokenIndex.py:227(read)
277632 3.480 0.000 3.480 0.000 TokenIndex.py:190(tell)
245279 2.630 0.000 2.630 0.000 TokenIndex.py:194(getbytes)
138817 1.570 0.000 1.570 0.000 TokenIndex.py:176(_loadBytes)
138817 2.430 0.000 2.430 0.000 TokenIndex.py:236(seekback)
138817 16.760 0.000 64.960 0.000
TokenIndex.py:244(__loadIndexLevel)
138817 2.050 0.000 2.050 0.000 TokenIndex.py:185(seek)
138816/1 9.790 0.000 78.410 78.410
TokenIndex.py:337(__ExtractTokenList)
2307 0.050 0.000 0.050 0.000 TokenIndex.py:217(getword)
2 0.000 0.000 0.000 0.000 posixpath.py:191(isfile)
1 0.000 0.000 78.410 78.410
TokenIndex.py:711(extractAllTokens)
1 0.000 0.000 78.420 78.420 TokenIndex.py:721(test)
1 0.010 0.010 78.590 78.590 profile:0(test())
1 0.160 0.160 78.580 78.580 <string>:1(?)
1 0.000 0.000 0.000 0.000 stat.py:54(S_ISREG)
1 0.000 0.000 0.000 0.000 stat.py:29(S_IFMT)
1 0.000 0.000 78.410 78.410
TokenIndex.py:355(_extractAllTokens)
1 0.000 0.000 0.000 0.000 TokenIndex.py:170(__init__)
1 0.000 0.000 0.010 0.010 TokenIndex.py:612(__init__)
1 0.000 0.000 0.010 0.010
TokenIndex.py:268(_getPayloadCount)
0 0.000 0.000 profile:0(profiler)
getbyte() is from SeekableFileArray,
fromBin() is a method of a ReadIndexEntry instance.
There are 527885 ReadIndexEntry's created in _loadIndexLevel.
Non profiled run takes 31 seconds. If I use the binaryFile class, it takes
22 seconds.
I "unrolled" fromBin() and inlined it in _loadIndexLevel, but now I'm losing
abstraction. However I got the time down to 14 seconds.
This is still too long (most folks won't have a XEON 2.2 GHZ machine)
I could write this in C, but before I go that route I'd thought I would ask
for suggestions...
Note that I want to continue to use Python level "file" objects, I may try
using the mmap module in the future, so want to leave that option open.
One more thing, I tried creating a generator for getbyte() in the
SeekableFileArray, but that was 2 seconds slower than the non generator
version.
class SeekableFileArray(object):
"""Combine file object with array handling"""
def __init__(self,fHandle,size=512):
self.fHandle = fHandle
self.size = size
print "size is ",size
self.array = array.array('B')
def _loadBytes(self):
"""Load some more bytes from file"""
try:
self.array.fromfile(self.fHandle,self.size)
except EOFError:
pass
return self.array
def seek(self,*args,**kw):
"""seek"""
self.array = self.array[0:0] # truncate array
return self.fHandle.seek(*args,**kw)
def tell(self,*args,**kw):
"""tell"""
return self.fHandle.tell(*args,**kw) - len(self.array)
def getbytes(self,l=1):
"""return bytes"""
a = self.array
if l > len(a):
self._loadBytes()
if l > len(a):
raise EOFError()
else:
v = a[:l]
self.array = a[l:]
return v
def getbyte(self):
"""return a byte"""
try:
return self.array.pop(0)
except IndexError:
self._loadBytes()
return self.array.pop(0)
def getword(self):
a = self.array
if len(a) < 2:
self._loadBytes()
b = a.pop(0)
c = a.pop(0)
return b * 256 + c
def read(self,l=1):
if l == 1:
try:
return chr(self.array.pop(0))
except IndexError:
self._loadBytes()
return chr(self.array.pop(0))
return self.getbytes(l).tostring()
def seekback(self):
"""seek backwards in inputfile so that unconsumed array bytes become
available in input stream"""
v = len(self.array)
if v > 0:
self.array = self.array[0:0] # truncate array
self.fHandle.seek(-v,1)
-----------== Posted via Newsfeed.Com - Uncensored Usenet News ==----------
http://www.newsfeed.com The #1 Newsgroup Service in the World!
-----= Over 100,000 Newsgroups - Unlimited Fast Downloads - 19 Servers =-----
More information about the Python-list
mailing list