[Spambayes] More HTML strippage.

Thu, 26 Sep 2002 20:02:26 -0400

[Anthony Baxter]
> I'm seeing email with stuff like:
>
>     <TD
>     style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt;
>     PADDING-BOTTOM: 0.75pt; WIDTH: 15.75pt; PADDING-TOP: 0.75pt"
>     width=21 rowSpan=4>
>
> in it - the current HTML stripping RE doesn't pick this up,
> as it's limited to 128 characters,

That's roughly true.

> and doesn't span lines.

That's not true.  The character class

    [^>]{0,128}

matches everything except a right-angle, including newlines.  I've also
gotten burned by long HTML comments, stuffed with style-sheet like
directives specific to the HTML editor the poster was using.

> I propose we change it from
> ...
> to
>
>     html_re = re.compile(r"""
>         <
>         [^\s<>]      # e.g., don't match 'a < b' or '<<<' or 'i
> << 5' or 'a<>b'
>         [^>]{0,256}? # search for the next '>', but with non-greedy RE
>         >
>     """, re.VERBOSE|re.DOTALL)

The DOTALL doesn't accomplish anything there (simple proof:  there isn't a
dot in the regexp <wink>).

> Will this cause people pain? Is there a better way?

I've got nothing against 256.  There are three things to watch out for:

1. The re engine is prone to blowing the stack if you do minimal
   matches instead, such as you did in the thing you actually
   checked in.

2. Since this gimmick is applied to *all* text, there's a danger of
   consuming legitimate msg text by accident.  That's the primary
   reason I always put bounds on how much text one of these will
   chew up.

3. Multiple passes are expensive; it's better to fold all these
   HTML-ish thingies into a single regexp with alternatives.
   I'll do that.

An alternative is to use a real HTML parser; I'm not going to make time for
that, though.