Newbie needs regex help

Mon Dec 6 10:29:41 EST 2010

Dan M wrote:

> I'm getting bogged down with backslash escaping.
> 
> I have some text files containing characters with the 8th bit set. These
> characters are encoded one of two ways: either "=hh" or "\xhh", where "h"
> represents a hex digit, and "\x" is a literal backslash followed by a
> lower-case x.
> 
> Catching the first case with a regex is simple. But when I try to write a
> regex to catch the second case, I mess up the escaping.
> 
> I took at look at http://docs.python.org/howto/regex.html, especially the
> section titled "The Backslash Plague". I started out trying :
> 
> dan at dan:~/personal/usenet$ python
> Python 2.7 (r27:82500, Nov 15 2010, 12:10:23)
> [GCC 4.3.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import re
>>>> r = re.compile('\\\\x([0-9a-fA-F]{2})')
>>>> a = "This \xef file \xef has \x20 a bunch \xa0 of \xb0 crap \xc0
> characters \xefn \xeft."
>>>> m = r.search(a)
>>>> m
> 
> No match.
> 
> I then followed the advice of the above-mentioned document, and expressed
> the regex as a raw string:
> 
>>>> r = re.compile(r'\\x([0-9a-fA-F]{2})')
>>>> r.search(a)
> 
> Still no match.
> 
> I'm obviously missing something. I spent a fair bit of time playing with
> this over the weekend, and I got nowhere. Now it's time to ask for help.
> What am I doing wrong here?

What you're missing is that string `a` doesn't actually contain four-
character sequences like '\', 'x', 'a', 'a' .  It contains single characters 
that you encode in string literals as '\xaa' and so on.  You might do better 
with

p1 = r'([\x80-\xff])'
r1 = re.compile (p1)
m = r1.search (a)

I get at least an <_sre.SRE_Match object at 0xb749a6e0> when I try this.

	Mel.