Weird problem matching with REs
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun May 29 10:18:16 EDT 2011
On Sun, 29 May 2011 08:41:16 -0500, Andrew Berg wrote:
> On 2011.05.29 08:09 AM, Steven D'Aprano wrote:
[...]
> Kodos is written in Python and uses Python's regex engine. In fact, it
> is specifically intended to debug Python regexes.
Fair enough.
>> Secondly, you probably should use a proper HTML parser, rather than a
>> regex. Resist the temptation to use regexes to rip out bits of text
>> from HTML, it almost always goes wrong eventually.
>
> I find this a much simpler approach, especially since I'm dealing with
> broken HTML. I guess I don't see how the effort put into learning a
> parser and adding the extra code to use it pays off in this particular
> endeavor.
The temptation to take short-cuts leads to the Dark Side :)
Perhaps you're right, in this instance. But if you need to deal with
broken HTML, try BeautifulSoup.
>> What makes you think it shouldn't match?
>
> AFAIK, dots aren't supposed to match carriage returns or any other
> whitespace characters.
They won't match *newlines* \n unless you pass the DOTALL flag, but they
do match whitespace:
>>> re.search('abc.efg', '----abc efg----').group()
'abc efg'
>>> re.search('abc.efg', '----abc\refg----').group()
'abc\refg'
>>> re.search('abc.efg', '----abc\nefg----') is None
True
--
Steven
More information about the Python-list
mailing list