[Spambayes] Re: [Spambayes-checkins] spambayes/spambayes message.py, 1.12, 1.13

Sun Apr 20 15:34:55 EDT 2003

4/20/2003 7:57:27 AM, Skip Montanaro <skip at pobox.com> wrote:

>
>    Tim> +         #XXX Tim doesn't like this regex.  He will study it to
>    Tim> +         #XXX see if he can learn to like it, or come up with
>    Tim> +         #XXX something he likes better.  - Tim
>    Tim>           return re.sub(r'(?:\r\n|\n|\r(?!\n))', "\r\n", data)
>
>I don't think the negative lookahead assertion is necessary.  By default,
>regular expressions are greedy (match the largest possible string), so
>
>    return re.sub(r"\r\n|\n|\r", "\r\n", data)
>
>should work.  Any time \r\n appears, it will be preferred over either \r or
>\n, so if the \r branch matches, that implies it wasn't followed by \n.  If
>I've been laboring under a misconception for all these years, I'd be happy
>to be corrected.

I took a hard look at this one too.  My intuition told me that there was 
something wrong with it, but I couldn't come up with an alternative that only 
used two alternations (three is worse than two...) that worked in all cases.  
The negative lookahead was especially distressing ;)  I worked for a while on 
one that would only require two alternations using negated character classes, 
but didn't have much luck because they required a capture of the negated 
character for the replacement, which is even more expensive than the negative 
lookahead.  Yours should work better than what's there right now, and I'm 
thinking about this one as well: '\r\n?|\n'

This is possibly a major performance consideration, so it really behooves us 
to take a good hard look at this regex (and the issue of why it's there in the 
first place).
>
>Skip
>
>

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.