problems with re in binary files
tim.one at home.com
Thu Sep 20 10:32:33 CEST 2001
[Steven D. Arnold]
> I'm trying to parse a binary file using the re module. At one point I
> use '<<(.*)>>' with the r.DOTALL option. I expect this to find a '<<'
> string, find the last '>>' string in the string (which contains the
> whole document; it's small), and everything between would go in the
> group. However, in practice, re seems to stop well before the end of
> the string, and well before the last instance of '>>'. In other
> words, the group doesn't seem to contain everything it should.
It's always better to give a self-contained, small program, than to try to
explain. For example, here's a small program:
test = "<<" + "\x00\n>>" * 1000 + ">>"
pat = re.compile(r"<<(.*)>>", re.DOTALL)
print "Test string has", len(test), "chars."
m = pat.search(test)
print "Group 1 spans slice %d:%d" % m.span(1)
print "Didn't match!"
What does that print when you run it? When I run it, it prints
Test string has 4004 chars.
Group 1 spans slice 2:4002
This shows that embedded null bytes, and embedded newlines, and 1000 "early"
hits on ">>", don't fool re. Therefore you have a bug in your Python, or
you haven't told us something *relevant* about why it isn't working for you.
If you show us actual code, it's much easier than guessing.
> I'm manipulating a binary file (and therefore a string with 8-bit
> binary characters), so I thought perhaps the `.' was not matching the
> NULL character.
As above, shouldn't matter.
> So I changed the expression above to '<<((.|\000)*)>>'. My
> understanding is that this should match either the normal dot regular
> expression, or a literal zero (NULL) character,
DOTALL does the same but quicker.
> and this pattern would then be matched zero or more times. However,
> the behavior is basically the same.
More evidence that you're not in the right ballpark yet.
> Anyone have any idea what I should do?
Post a failing test case, and the cause will be obvious to someone.
it's-always-the-last-place-you-look-ly y'rs - tim
More information about the Python-list