[Numpy-discussion] Fastest way to parsing a specific binay file

Gökhan Sever gokhansever at gmail.com
Wed Sep 2 14:42:06 EDT 2009


On Wed, Sep 2, 2009 at 12:46 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Wed, Sep 2, 2009 at 12:33, Gökhan Sever<gokhansever at gmail.com> wrote:
> > How your find suggestion work? It just returns the location of the first
> > occurrence.
>
> http://docs.python.org/library/stdtypes.html#str.find
>
> str.find(sub[, start[, end]])
>    Return the lowest index in the string where substring sub is
> found, such that sub is contained in the range [start, end]. Optional
> arguments start and end are interpreted as in slice notation. Return
> -1 if sub is not found.
>
> But perhaps you should profile your code to see where it is actually
> taking up the time. Regexes on 1.3 MB of data should be quite fast.
>
> In [21]: marker = '\x00\x00\@\x00$\x00\x02'
>
> In [22]: block = marker + '\xde\xca\xfb\xad' * ((1024-8) // 4)
>
> In [23]: data = int(round(1.3 * 1024)) * block
>
> In [24]: import re
>
> In [25]: r = re.compile(re.escape(marker))
>
> In [26]: %time r.findall(data)
> CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
> Wall time: 0.01 s
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  -- Umberto Eco
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


This is what I have been using. It's not returning exactly what I want but
very close besides its being slow:

I[52]: mypattern = re.compile('\0\0\1\0.+?\0\0@\0\$', re.DOTALL)

I[53]: res = mypattern.findall(ss)

I[54]: len res
-----> len(res)
O[54]: 95

I[55]: %time mypattern.findall(ss);
CPU times: user 9.14 s, sys: 0.00 s, total: 9.14 s
Wall time: 9.16 s

I[57]: res[0]
O[57]:
'\x00\x00\x01\x00\x00\x00\xd9\x07\x04\x00\x02\x00\r\x00\x06\x00\x03\x00\x00\x00\x01\x00\x00\x00
*prj.300*\x00; Version = 1\nProjectName = PME1 2009 King Air
N825ST\nFlightId = \nAircraftType = WMI King Air 200\nAircraftId =
N825ST\nOperatorName = Weather Modification Inc.\nComments = \n\x00\x00@
\x00$'

I need the part starting with the bold typed section (prj.300) and till the
end of the section. I need the bold part because I can construct file names
from that and write the following content in it.

Ohh when it works the resulting search should return me 86 occurrence.


-- 
Gökhan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090902/5522a99f/attachment.html>


More information about the NumPy-Discussion mailing list