problems with re in binary files

Thu Sep 20 04:07:03 EDT 2001

Hi,

I'm trying to parse a binary file using the re module.  At one point I
use '<<(.*)>>' with the r.DOTALL option.  I expect this to find a '<<'
string, find the last '>>' string in the string (which contains the
whole document; it's small), and everything between would go in the
group.  However, in practice, re seems to stop well before the end of
the string, and well before the last instance of '>>'.  In other
words, the group doesn't seem to contain everything it should.

I'm manipulating a binary file (and therefore a string with 8-bit
binary characters), so I thought perhaps the `.' was not matching the
NULL character.  So I changed the expression above to
'<<((.|\000)*)>>'.  My understanding is that this should match either
the normal dot regular expression, or a literal zero (NULL) character,
and this pattern would then be matched zero or more times.  However,
the behavior is basically the same.

Anyone have any idea what I should do?  I'm thinking of translating
the whole string to hexadecimal, then converting any numbers less than
or equal to 0x7F back to 7-bit ASCII, but that sounds like it'd be
really slow.  What am I missing?  Is there a canonical way to handle
this?

-----------------------------------------------------
Steven D. Arnold        stevena at nospam.neosynapse.net
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~