Newbie needs regex help

Dan M dan at catfolks.net
Mon Dec 6 10:03:48 EST 2010


I'm getting bogged down with backslash escaping.

I have some text files containing characters with the 8th bit set. These 
characters are encoded one of two ways: either "=hh" or "\xhh", where "h" 
represents a hex digit, and "\x" is a literal backslash followed by a 
lower-case x.

Catching the first case with a regex is simple. But when I try to write a 
regex to catch the second case, I mess up the escaping.

I took at look at http://docs.python.org/howto/regex.html, especially the 
section titled "The Backslash Plague". I started out trying :

dan at dan:~/personal/usenet$ python
Python 2.7 (r27:82500, Nov 15 2010, 12:10:23) 
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r = re.compile('\\\\x([0-9a-fA-F]{2})')
>>> a = "This \xef file \xef has \x20 a bunch \xa0 of \xb0 crap \xc0 
characters \xefn \xeft."
>>> m = r.search(a)
>>> m

No match.

I then followed the advice of the above-mentioned document, and expressed 
the regex as a raw string:

>>> r = re.compile(r'\\x([0-9a-fA-F]{2})')
>>> r.search(a)

Still no match.

I'm obviously missing something. I spent a fair bit of time playing with 
this over the weekend, and I got nowhere. Now it's time to ask for help. 
What am I doing wrong here?




More information about the Python-list mailing list