Manipulate Large Binary Files

George Sakkis george.sakkis at gmail.com
Wed Apr 2 15:56:07 EDT 2008


On Apr 2, 2:09 pm, "Derek Tracy" <trac... at gmail.com> wrote:
> On Wed, Apr 2, 2008 at 10:59 AM, Derek Tracy <trac... at gmail.com> wrote:
> > I am trying to write a script that reads in a large binary file (over 2Gb) saves the header file (169088 bytes) into one file then take the rest of the data and dump it into anther file.  I generated code that works wonderfully for files under 2Gb in size but the majority of the files I am dealing with are over the 2Gb limit
>
> > INPUT = open(infile, 'rb')
> > header = FH.read(169088)
>
> > ary = array.array('H', INPUT.read())
>
> > INPUT.close()
>
> > OUTF1 = open(outfile1, 'wb')
> > OUTF1.write(header)
>
> > OUTF2 = open(outfile2, 'wb')
> > ary.tofile(OUTF2)
>
> > When I try to use the above on files over 2Gb I get:
> >      OverflowError: requested number of bytes is more than a Python string can hold
>
> > Does anybody have an idea as to how I can get by this hurdle?
>
> > I am working in an environment that does not allow me to freely download modules to use.  Python version 2.5.1
>
> > R/S --
> > ---------------------------------
> > Derek Tracy
> > trac... at gmail.com
> > ---------------------------------
>
> I know have 2 solutions, one using
> partial
> and the other using array
>
> Both are clocking in at the same time (1m 5sec for 2.6Gb), are there
> any ways I can optimize either solution?  Would turning off the
> read/write buff increase speed?

You may try to increase the buffering size when you open() the file
and see if this helps:

def iterchunks(filename, buffering):
    return iter(partial(open(filename,buffering=buffering).read,
buffering), '')

for chunk in iterchunks(filename, 32*1024): pass
#for chunk in iterchunks(filename, 1024**2): pass
#for chunk in iterchunks(filename, 10*1024**2): pass


George



More information about the Python-list mailing list