Re: [Spambayes-checkins] spambayes/spambayes message.py, 1.12, 1.13
Tim Stone - Four Stones Expressions
tim at fourstonesExpressions.com
Sun Apr 20 15:34:55 EDT 2003
4/20/2003 7:57:27 AM, Skip Montanaro <skip at pobox.com> wrote:
> Tim> + #XXX Tim doesn't like this regex. He will study it to
> Tim> + #XXX see if he can learn to like it, or come up with
> Tim> + #XXX something he likes better. - Tim
> Tim> return re.sub(r'(?:\r\n|\n|\r(?!\n))', "\r\n", data)
>I don't think the negative lookahead assertion is necessary. By default,
>regular expressions are greedy (match the largest possible string), so
> return re.sub(r"\r\n|\n|\r", "\r\n", data)
>should work. Any time \r\n appears, it will be preferred over either \r or
>\n, so if the \r branch matches, that implies it wasn't followed by \n. If
>I've been laboring under a misconception for all these years, I'd be happy
>to be corrected.
I took a hard look at this one too. My intuition told me that there was
something wrong with it, but I couldn't come up with an alternative that only
used two alternations (three is worse than two...) that worked in all cases.
The negative lookahead was especially distressing ;) I worked for a while on
one that would only require two alternations using negated character classes,
but didn't have much luck because they required a capture of the negated
character for the replacement, which is even more expensive than the negative
lookahead. Yours should work better than what's there right now, and I'm
thinking about this one as well: '\r\n?|\n'
This is possibly a major performance consideration, so it really behooves us
to take a good hard look at this regex (and the issue of why it's there in the
c'est moi - TimS
There are 10 kinds of people in the world:
those who understand binary,
and those who don't.
More information about the Spambayes