Skipping bytes while reading a binary file?

Fri Feb 6 08:15:30 EST 2009

On Feb 5, 11:51 pm, Lionel <lionel.ke... at gmail.com> wrote:
> On Feb 5, 3:35 pm, Lionel <lionel.ke... at gmail.com> wrote:
>
>
>
> > On Feb 5, 2:56 pm, Lionel <lionel.ke... at gmail.com> wrote:
>
> > > On Feb 5, 2:48 pm, MRAB <goo... at mrabarnett.plus.com> wrote:
>
> > > > Lionel wrote:
>
> > > >  > Hello,
> > > >  > I have data stored in binary files. Some of these files are
> > > >  > huge...upwards of 2 gigs or more. They consist of 32-bit float complex
> > > >  > numbers where the first 32 bits of the file is the real component, the
> > > >  > second 32bits is the imaginary, the 3rd 32-bits is the real component
> > > >  > of the second number, etc.
> > > >  >
> > > >  > I'd like to be able to read in just the real components, load them
> > > >  > into a numpy.ndarray, then load the imaginary coponents and load them
> > > >  > into a numpy.ndarray.  I need the real and imaginary components stored
> > > >  > in seperate arrays, they cannot be in a single array of complex
> > > >  > numbers except for temporarily. I'm trying to avoid temporary storage,
> > > >  > though, because of the size of the files.
> > > >  >
> > > >  > I'm currently reading the file scanline-by-scanline to extract rows of
> > > >  > complex numbers which I then loop over and load into the real/
> > > >  > imaginary arrays as follows:
> > > >  >
> > > >  >
> > > >  >         self._realData         = numpy.empty((Rows, Columns), dtype =
> > > >  > numpy.float32)
> > > >  >         self._imaginaryData = numpy.empty((Rows, Columns), dtype =
> > > >  > numpy.float32)
> > > >  >
> > > >  >         floatData = array.array('f')
> > > >  >
> > > >  >         for CurrentRow in range(Rows):
> > > >  >
> > > >  >             floatData.fromfile(DataFH, (Columns*2))
> > > >  >
> > > >  >             position = 0
> > > >  >             for CurrentColumn in range(Columns):
> > > >  >
> > > >  >                  self._realData[CurrentRow, CurrentColumn]          =
> > > >  > floatData[position]
> > > >  >                 self._imaginaryData[CurrentRow, CurrentColumn]  =
> > > >  > floatData[position+1]
> > > >  >                 position = position + 2
> > > >  >
> > > >  >
> > > >  > The above code works but is much too slow. If I comment out the body
> > > >  > of the "for CurrentColumn in range(Columns)" loop, the performance is
> > > >  > perfectly adequate i.e. function call overhead associated with the
> > > >  > "fromfile(...)" call is not very bad at all. What seems to be most
> > > >  > time-consuming are the simple assignment statements in the
> > > >  > "CurrentColumn" for-loop.
> > > >  >
> > > > [snip]
> > > > Try array slicing. floatData[0::2] will return the real parts and
> > > > floatData[1::2] will return the imaginary parts. You'll have to read up
> > > > how to assign to a slice of the numpy array (it might be
> > > > "self._realData[CurrentRow] = real_parts" or "self._realData[CurrentRow,
> > > > :] = real_parts").
>
> > > > BTW, it's not the function call overhead of fromfile() which takes the
> > > > time, but actually reading data from the file.
>
> > > Very nice! I like that! I'll post the improvement (if any).
>
> > > L- Hide quoted text -
>
> > > - Show quoted text -
>
> > Okay, the following:
>
> >             self._realData[CurrentRow]      = floatData[0::2]
> >             self._imaginaryData[CurrentRow] = floatData[1::2]
>
> > gives a 3.5x improvement in execution speed over the original that I
> > posted. That's much better. Thank you for the suggestion.
>
> > L- Hide quoted text -
>
> > - Show quoted text -
>
> Correction: improvement is around 7-8x.

I had similar issues while Slicing Network packets (TCP/UDP) on a real
time basis.
I was using 're' and found it a lot more time and resource consuming,
than 'normal' string slicing as suggested by MRAB.

K