[spambayes-dev] fp: innocuous text in hidden html <input>

Fri Oct 3 23:01:00 EDT 2003

[Skip]
> I believe the tokenizer strips out all HTML tags, at least it makes a
> good effort to do so.

That's right, but note that we have no idea whether we're parsing HTML or
XML or plain text or whatever.  Spam is too-often too ill-formed to rely on
what standards say, and user email clients are extremely forgiving about
violations.

> It uses a fancy-schmancy regular expression Tim Peters wrote to make
> it fast, but I believe it's also limited in what it believes the
> maximum length of an HTML tag can be:

Also right.

>     # Cheap-ass gimmick to probabilistically find HTML/XML tags.
>     # Note that <style and HTML comments are handled by
>     crack_html_style() # and crack_html_comment() instead -- they can
>     be very long, and long # minimal matches have a nasty habit of
>     blowing the C stack.
> html_re = re.compile(r"""
>      <
>      (?![\s<>])  # ...don't match 'a < b' or '<<<' or 'i<<5' or 'a<>b'
>      # guessing that other tags are usually "short"
>      [^>]{0,256} # search for the end '>', but don't run wild
>      >
>     """, re.VERBOSE | re.DOTALL)
>
> It's not completely obvious, but it appears the <input> tag in your
> message contains over 300 characters, so it would be missed by the
> above regular expression.

Right.

> I don't know if it's time to try something different, boost the above
> 256 to something larger, or do nothing and rely on more training to
> squash that bug.

Despite what the comment says, that regexp no longer uses a minimal matching
operator, and [^>]{0,256} won't blow the C stack no matter how large the
upper bound is made.  So we could boost the 256, but at the risk of throwing
away ordinary message text that just happens to contain stuff matching that
regexp:

    Suppose a<b.   It follows that ...
    ... 50 lines elided ...
    Now supose a>b. ...

That example doesn't match today, but only because of the 256-character
limit.

It would be easy to add an input-tag stripper similar to the HTML comment
stripper, though.  If we add enough of those it would be faster to do a real
parse.

More training probably won't help except to nail duplicates of the original
spam.  Despite what Doug thinks <wink>, the hidden value isn't "hammy" --
it's just crap, and shouldn't be able to do worse than knock a spam down
into the Unsure range for some people some of the time, and into the Ham
range for few people rarely.