problems with re in binary files
Steven D. Arnold
stevena at neosynapse.net
Thu Sep 20 04:07:03 EDT 2001
Hi,
I'm trying to parse a binary file using the re module. At one point I
use '<<(.*)>>' with the r.DOTALL option. I expect this to find a '<<'
string, find the last '>>' string in the string (which contains the
whole document; it's small), and everything between would go in the
group. However, in practice, re seems to stop well before the end of
the string, and well before the last instance of '>>'. In other
words, the group doesn't seem to contain everything it should.
I'm manipulating a binary file (and therefore a string with 8-bit
binary characters), so I thought perhaps the `.' was not matching the
NULL character. So I changed the expression above to
'<<((.|\000)*)>>'. My understanding is that this should match either
the normal dot regular expression, or a literal zero (NULL) character,
and this pattern would then be matched zero or more times. However,
the behavior is basically the same.
Anyone have any idea what I should do? I'm thinking of translating
the whole string to hexadecimal, then converting any numbers less than
or equal to 0x7F back to 7-bit ASCII, but that sounds like it'd be
really slow. What am I missing? Is there a canonical way to handle
this?
-----------------------------------------------------
Steven D. Arnold stevena at nospam.neosynapse.net
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
More information about the Python-list
mailing list