Newbie needs regex help
Mel
mwilson at the-wire.com
Mon Dec 6 10:29:41 EST 2010
Dan M wrote:
> I'm getting bogged down with backslash escaping.
>
> I have some text files containing characters with the 8th bit set. These
> characters are encoded one of two ways: either "=hh" or "\xhh", where "h"
> represents a hex digit, and "\x" is a literal backslash followed by a
> lower-case x.
>
> Catching the first case with a regex is simple. But when I try to write a
> regex to catch the second case, I mess up the escaping.
>
> I took at look at http://docs.python.org/howto/regex.html, especially the
> section titled "The Backslash Plague". I started out trying :
>
> dan at dan:~/personal/usenet$ python
> Python 2.7 (r27:82500, Nov 15 2010, 12:10:23)
> [GCC 4.3.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import re
>>>> r = re.compile('\\\\x([0-9a-fA-F]{2})')
>>>> a = "This \xef file \xef has \x20 a bunch \xa0 of \xb0 crap \xc0
> characters \xefn \xeft."
>>>> m = r.search(a)
>>>> m
>
> No match.
>
> I then followed the advice of the above-mentioned document, and expressed
> the regex as a raw string:
>
>>>> r = re.compile(r'\\x([0-9a-fA-F]{2})')
>>>> r.search(a)
>
> Still no match.
>
> I'm obviously missing something. I spent a fair bit of time playing with
> this over the weekend, and I got nowhere. Now it's time to ask for help.
> What am I doing wrong here?
What you're missing is that string `a` doesn't actually contain four-
character sequences like '\', 'x', 'a', 'a' . It contains single characters
that you encode in string literals as '\xaa' and so on. You might do better
with
p1 = r'([\x80-\xff])'
r1 = re.compile (p1)
m = r1.search (a)
I get at least an <_sre.SRE_Match object at 0xb749a6e0> when I try this.
Mel.
More information about the Python-list
mailing list