[Tutor] Read-ahead for large fixed-width binary files?
Kent Johnson
kent37 at tds.net
Sun Nov 18 05:20:09 CET 2007
I would wrap the record buffering into a generator function and probably
use plain slicing to return the individual records instead of StringIO.
I have a writeup on generators here:
http://personalpages.tds.net/~kent37/kk/00004.html
Kent
Marc Tompkins wrote:
> Alan Gauld wrote:
>
> "Marc Tompkins" <marc.tompkins at gmail.com
> <mailto:marc.tompkins at gmail.com>> wrote
> > realized I can implement this myself, using 'read(bigsize)' -
> > currently I'm using 'read(recordsize)'; I just need to add an extra
> > loop around my record reads. Please disregard...
> If you just want to navigate to a specific record then it might be
> easier to use seek(), that will save you having to read all the
> previous records into memory.
>
>
> No, I need to parse the entire file, checking records as I go. Here's
> the solution I came up with - I'm sure it could be optimized, but it's
> already about six times faster than going record-by-record:
>
> def loadInsurance(self):
> header = ('Code', 'Name')
> Global.Ins.append(header)
> obj = Insurance()
> recLen = obj.RecordLength
> for offNum, offPath in Global.offices.iteritems():
> if (offPath.Ref == ''):
> offPath.Ref = offPath.Default
> with open(offPath.Ref + obj.TLA + '.dat','rb') as inFile:
> tmpIn = inFile.read(recLen) # throw away the
> header record
> tmpIn = inFile.read(recLen*4096)
> while not (len(tmpIn) < recLen):
> buf = StringIO.StringIO (tmpIn)
> inRec = buf.read(recLen)
> while not (len(inRec) < recLen):
> obj = Insurance(inRec)
> if (obj.Valid):
> Global.Ins.append (obj.ID, obj.Name)
> inRec = buf.read(recLen)
> buf.close()
> tmpIn = inFile.read(recLen*4096)
>
> Obviously this is taken out of context, and I'm afraid I'm too lazy to
> sanitize it (much) for posting right now, so here's a brief summary
> instead.
>
> 1- I don't want my calling code to need to know many details. So if I
> create an object with no parameters, it provides me with the record
> length (files vary from 80-byte records up to 1024) and the TLA portion
> of the filename (the data files are named in the format xxTLA.dat, where
> xx is the 2-digit office number and TLA is the three-letter acronym for
> what the file contains - e.g. INS for insurance.)
>
> 2- Using the information I just obtained, I then read through the file
> one record-length chunk at a time, creating an object out of each chunk
> and reading the attributes of that object. In the next version of my
> class library, I'll move the whole list-generation logic inside the
> classes so I can just pass in a filename and receive a list... but
> that's one for my copious free time.
>
> 3- Each file contains a header record, which is pure garbage. I read
> it in and throw it away before I even begin. (I could seek to just past
> it instead - would it really be more efficient?)
>
> 4- Now here's where the read-ahead buffer comes in - I (attempt to)
> read 4096 records' worth of data, and store it in a StringIO file-like
> object. (4096 is just a number I pulled out of the air, but I've tried
> increasing and decreasing it, and it seems good. If I have the time, I
> may benchmark to find the best number for each record length, and
> retrieve that number along with the record length and TLA. Of course,
> the optimal number probably varies per machine, so maybe I won't bother.)
>
> 5- Now I go through the buffer, one record's worth at a time, and do
> whatever I'm doing with the records - in this case, I'm making a list of
> insurance company IDs and names to display in a wx.CheckListCtrl .
>
> 6- If I try to read past the end of the file, there's no error - so I
> need to check the size of what's returned. If it's smaller than recLen,
> I know I've hit the end.
> 6a- When I hit the end of the buffer, I close it and read in another
> 4096 records.
> 6b- When I try to read 4096 records, and end up with less than recLen,
> I know I've hit the end of the file.
>
> I've only tested on a few machines/client databases so far, but when I
> added step 4, processing a 250MB transaction table (256-byte records)
> went from nearly 30 seconds down to about 3.5 seconds. Other results
> have varied, but they've all shown improvement.
>
> If anybody sees any glaring inefficiencies, let me know; OTOH if anybody
> else needs to do something similar... here's one way to do it.
>
> --
> www.fsrtechnologies.com <http://www.fsrtechnologies.com>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
More information about the Tutor
mailing list