[Tutor] Read-ahead for large fixed-width binary files?

Kent Johnson kent37 at tds.net
Sun Nov 18 05:20:09 CET 2007


I would wrap the record buffering into a generator function and probably 
use plain slicing to return the individual records instead of StringIO. 
I have a writeup on generators here:

http://personalpages.tds.net/~kent37/kk/00004.html

Kent

Marc Tompkins wrote:
> Alan Gauld wrote:
> 
>     "Marc Tompkins" <marc.tompkins at gmail.com
>     <mailto:marc.tompkins at gmail.com>> wrote
>      > realized I can implement this myself, using 'read(bigsize)' -
>      > currently I'm using 'read(recordsize)'; I just need to add an extra
>      > loop around my record reads.  Please disregard...
>     If you just want to navigate to a specific record then it might be
>     easier to use seek(), that will save you having to read all the
>     previous records into memory.
> 
> 
> No, I need to parse the entire file, checking records as I go.  Here's 
> the solution I came up with - I'm sure it could be optimized, but it's 
> already about six times faster than going record-by-record:
> 
> def loadInsurance(self):
>     header = ('Code', 'Name')
>     Global.Ins.append(header)
>     obj = Insurance()
>     recLen = obj.RecordLength
>     for offNum, offPath in Global.offices.iteritems():
>         if (offPath.Ref == ''):
>             offPath.Ref = offPath.Default
>         with open(offPath.Ref + obj.TLA + '.dat','rb') as inFile:
>             tmpIn = inFile.read(recLen)                 # throw away the 
> header record
>             tmpIn = inFile.read(recLen*4096)
>             while not (len(tmpIn) < recLen):
>                 buf = StringIO.StringIO (tmpIn)
>                 inRec = buf.read(recLen)
>                 while not (len(inRec) < recLen):
>                     obj = Insurance(inRec)
>                     if (obj.Valid):
>                         Global.Ins.append (obj.ID, obj.Name)
>                     inRec = buf.read(recLen)
>                 buf.close()
>                 tmpIn = inFile.read(recLen*4096)
> 
> Obviously this is taken out of context, and I'm afraid I'm too lazy to 
> sanitize it (much) for posting right now, so here's a brief summary 
> instead.
> 
> 1-  I don't want my calling code to need to know many details.  So if I 
> create an object with no parameters, it provides me with the record 
> length (files vary from 80-byte records up to 1024) and the TLA portion 
> of the filename (the data files are named in the format xxTLA.dat, where 
> xx is the 2-digit office number and TLA is the three-letter acronym for 
> what the file contains - e.g. INS for insurance.) 
> 
> 2-  Using the information I just obtained, I then read through the file 
> one record-length chunk at a time, creating an object out of each chunk 
> and reading the attributes of that object.  In the next version of my 
> class library, I'll move the whole list-generation logic inside the 
> classes so I can just pass in a filename and receive a list... but 
> that's one for my copious free time.
> 
> 3-  Each file contains a header record, which is pure garbage.  I read 
> it in and throw it away before I even begin.  (I could seek to just past 
> it instead - would it really be more efficient?)
> 
> 4-  Now here's where the read-ahead buffer comes in - I (attempt to) 
> read 4096 records' worth of data, and store it in a StringIO file-like 
> object.  (4096 is just a number I pulled out of the air, but I've tried 
> increasing and decreasing it, and it seems good.  If I have the time, I 
> may benchmark to find the best number for each record length, and 
> retrieve that number along with the record length and TLA.  Of course, 
> the optimal number probably varies per machine, so maybe I won't bother.)
> 
> 5-  Now I go through the buffer, one record's worth at a time, and do 
> whatever I'm doing with the records - in this case, I'm making a list of 
> insurance company IDs and names to display in a wx.CheckListCtrl .
> 
> 6-  If I try to read past the end of the file, there's no error - so I 
> need to check the size of what's returned.  If it's smaller than recLen, 
> I know I've hit the end.
>  6a- When I hit the end of the buffer, I close it and read in another 
> 4096 records. 
>  6b- When I try to read 4096 records, and end up with less than recLen, 
> I know I've hit the end of the file.
> 
> I've only tested on a few machines/client databases so far, but when I 
> added step 4, processing a 250MB transaction table (256-byte records) 
> went from nearly 30 seconds down to about 3.5 seconds.  Other results 
> have varied, but they've all shown improvement.
> 
> If anybody sees any glaring inefficiencies, let me know; OTOH if anybody 
> else needs to do something similar... here's one way to do it.
> 
> -- 
> www.fsrtechnologies.com <http://www.fsrtechnologies.com>
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor



More information about the Tutor mailing list