Weird problem matching with REs

Sun May 29 09:41:16 EDT 2011

On 2011.05.29 08:09 AM, Steven D'Aprano wrote:
> On Sun, 29 May 2011 06:45:30 -0500, Andrew Berg wrote:
>
> > I have an RE that should work (it even works in Kodos [1], but not in my
> > code), but it keeps failing to match characters after a newline.
>
> Not all regexes are the same. Different regex engines accept different 
> symbols, and sometimes behave differently, or have different default 
> behavior. That your regex works in Kodos but not Python might mean you're 
> writing a Kodus regex instead of a Python regex.
Kodos is written in Python and uses Python's regex engine. In fact, it
is specifically intended to debug Python regexes.
> Firstly, most of the code you show is irrelevant to the problem. Please 
> simplify it to the shortest, most simple example you can give. That would 
> be a simplified piece of text (not the entire web page!), the regex, and 
> the failed attempt to use it. The rest of your code is just noise for the 
> purposes of solving this problem.
I wasn't sure how much would be relevant since it could've been a
problem with other code. I do apologize for not putting more effort into
trimming it down, though.
> Secondly, you probably should use a proper HTML parser, rather than a 
> regex. Resist the temptation to use regexes to rip out bits of text from 
> HTML, it almost always goes wrong eventually.
I find this a much simpler approach, especially since I'm dealing with
broken HTML. I guess I don't see how the effort put into learning a
parser and adding the extra code to use it pays off in this particular
endeavor.
> > I was able to make a regex that matches in my code, but it shouldn't:
> > http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/
> x264.\n{1,3}.\n{1,3}.exe
>
> What makes you think it shouldn't match?
AFAIK, dots aren't supposed to match carriage returns or any other
whitespace characters.
> By the way, you probably should escape the dots, otherwise it will match 
> strings containing any arbitrary character, rather than *just* dots:
You're right; I overlooked the dots in the URL.