[Numpy-discussion] Fastest way to parsing a specific binay file

Gökhan Sever gokhansever at gmail.com
Thu Sep 3 00:59:54 EDT 2009


On Wed, Sep 2, 2009 at 1:58 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Wed, Sep 2, 2009 at 13:28, Gökhan Sever<gokhansever at gmail.com> wrote:
> > Put the reference manual in:
> >
> > http://drop.io/1plh5rt
> >
> > First few pages describe the data format they use.
>
> Ah. The fields are *not* delimited by a fixed value. Regexes are no
> help to you for pulling out the information you need, except perhaps
> later to parse the text fields. I think you are also getting spurious
> results because your regex matches things inside data fields.
>
> Instead, you have a header containing the length of the data field
> followed by the data field. Create a structured dtype that corresponds
> to the DataDir struct on page 15. Note that "unsigned int" there is
> actually a numpy.uint16, not a uint32.
>
>  dt = np.dtype([('tagNumber', np.uint16), ('dataOffset', np.uint16),
> ('numberBytes', np.uint16), ('samples', np.uint16), ('bytesPerSample',
> np.uint16), ('type', np.uint8), ('param1', np.uint8), ('param2',
> np.uint8), ('param3', np.uint8), ('address', np.uint16)])
>
> Now read dt.itemsize bytes from the file and use
>
>  header = fromstring(f.read(dt.itemsize), dt)[0]
>
> to get a record object that corresponds to the header. Use the
> dataOffset and numberBytes fields to extract the actual data bytes
> from the file.
>
> For example, if we go to the second header field:
>
> In [28]: f.seek(dt.itemsize,0)
>
> In [29]: header = np.fromstring(f.read(dt.itemsize), dt)[0]
>
> In [30]: header
> Out[30]: (65530, 100, 8, 1, 8, 255, 0, 0, 0, 43605)
>
> In [31]: f.seek(header['dataOffset'], 0)
>
> In [32]: f.read(header['numberBytes'])
> Out[32]: 'prj.300\x00'
>
>
> There are still some semantic issues you need to work out, still.
> There are multiple "buffers" per file, and the dataOffsets are
> relative to the start of the buffer, not the file.
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  -- Umberto Eco
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

Robert,

You must have thrown a couple RTFM's while replying my emails :) I usually
take trial-error approaches initially, and don't give up unless I hit a
hurdle so fast, which in this case resulted with the unsuccessful regex
approach. However from the good point I have learnt the basics of regular
expressions and realized how powerful could they be during a text parsing
task.

Enough prattle, below is what I am working on:

So far I was successfully able to extract the file names and the data
associated with those names (with the exception of multiple buffer per file
cases).

However not reading time increments correctly, I should be seeing 1 sec
incremental time ticks from the time segment reading, but all it does is to
return the same first time information.

Furthermore, I still couldn't figure out how to wrap the main looping suite
(range(500) is just a dummy number which will let me process whole binary
data) I don't know yet how to make the range input generic which will work
any size of similar binary file.


import numpy as np
import struct

f = open('test.sea', 'rb')

dt = np.dtype([('tagNumber', np.uint16), ('dataOffset', np.uint16),
('numberBytes', np.uint16), ('samples', np.uint16), ('bytesPerSample',
np.uint16), ('type', np.uint8), ('param1', np.uint8), ('param2',
np.uint8), ('param3', np.uint8), ('address', np.uint16)])


start = 0
ct = 0

for i in range(500):

    header = np.fromstring(f.read(dt.itemsize), dt)[0]

    if header['tagNumber'] == 65530:
        loc = f.tell()
        f.seek(start + header['dataOffset'])
        f.read(header['numberBytes'])
        f.seek(loc)
    elif header['tagNumber'] == 65531:
        loc = f.tell()
        f.seek(start + header['dataOffset'])
        f.read(header['numberBytes'])
        start = f.tell()
    elif header['tagNumber'] == 0:
        loc = f.tell()
        f.seek(start + header['dataOffset'])
        print f.tell()
        k = f.read(header['numberBytes']
        print struct.unpack('9h', k[:18])
        f.seek(loc)
        ct += 1



-- 
Gökhan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090902/60995403/attachment.html>


More information about the NumPy-Discussion mailing list