[Numpy-discussion] Fastest way to parsing a specific binay file

Thu Sep 3 01:22:18 EDT 2009

On Wed, Sep 2, 2009 at 23:59, Gökhan Sever<gokhansever at gmail.com> wrote:

> Robert,
>
> You must have thrown a couple RTFM's while replying my emails :)

Not really. There's no manual for this. Greg Wilson's _Data Crunching_
may be a good general introduction to how to think about these
problems.

http://www.pragprog.com/titles/gwd/data-crunching

> I usually
> take trial-error approaches initially, and don't give up unless I hit a
> hurdle so fast, which in this case resulted with the unsuccessful regex
> approach. However from the good point I have learnt the basics of regular
> expressions and realized how powerful could they be during a text parsing
> task.
>
> Enough prattle, below is what I am working on:
>
> So far I was successfully able to extract the file names and the data
> associated with those names (with the exception of multiple buffer per file
> cases).
>
> However not reading time increments correctly, I should be seeing 1 sec
> incremental time ticks from the time segment reading, but all it does is to
> return the same first time information.
>
> Furthermore, I still couldn't figure out how to wrap the main looping suite
> (range(500) is just a dummy number which will let me process whole binary
> data) I don't know yet how to make the range input generic which will work
> any size of similar binary file.

while True:
   ...

   if no_more_data():
       break

> import numpy as np
> import struct
>
> f = open('test.sea', 'rb')
>
> dt = np.dtype([('tagNumber', np.uint16), ('dataOffset', np.uint16),
> ('numberBytes', np.uint16), ('samples', np.uint16), ('bytesPerSample',
> np.uint16), ('type', np.uint8), ('param1', np.uint8), ('param2',
> np.uint8), ('param3', np.uint8), ('address', np.uint16)])
>
>
> start = 0
> ct = 0
>
> for i in range(500):
>
>     header = np.fromstring(f.read(dt.itemsize), dt)[0]
>
>     if header['tagNumber'] == 65530:
>         loc = f.tell()
>         f.seek(start + header['dataOffset'])
>         f.read(header['numberBytes'])

Presumably you are doing something with this data, not just discarding it.

>         f.seek(loc)

This should be f.seek(loc, 0). f.seek(nbytes) is to seek forward from
the current position by nbytes. The 0 tells it to start from the
beginning.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco