Weird problem matching with REs

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun May 29 10:18:16 EDT 2011


On Sun, 29 May 2011 08:41:16 -0500, Andrew Berg wrote:

> On 2011.05.29 08:09 AM, Steven D'Aprano wrote:
[...]
> Kodos is written in Python and uses Python's regex engine. In fact, it
> is specifically intended to debug Python regexes.

Fair enough.

>> Secondly, you probably should use a proper HTML parser, rather than a
>> regex. Resist the temptation to use regexes to rip out bits of text
>> from HTML, it almost always goes wrong eventually.
>
> I find this a much simpler approach, especially since I'm dealing with
> broken HTML. I guess I don't see how the effort put into learning a
> parser and adding the extra code to use it pays off in this particular
> endeavor.

The temptation to take short-cuts leads to the Dark Side :)

Perhaps you're right, in this instance. But if you need to deal with 
broken HTML, try BeautifulSoup.


>> What makes you think it shouldn't match?
> 
> AFAIK, dots aren't supposed to match carriage returns or any other
> whitespace characters.

They won't match *newlines* \n unless you pass the DOTALL flag, but they 
do match whitespace:

>>> re.search('abc.efg', '----abc efg----').group()
'abc efg'
>>> re.search('abc.efg', '----abc\refg----').group()
'abc\refg'
>>> re.search('abc.efg', '----abc\nefg----') is None
True


-- 
Steven



More information about the Python-list mailing list