[spambayes-dev] fp: innocuous text in hidden html <input>

Fri Oct 3 10:16:43 EDT 2003

    Doug> I have to admit this is a clever spam technique -- I've taken a
    Doug> quick look in the archives and read through tokenizer.py and seen
    Doug> nothing about it. The trick is that the message has a number of
    Doug> <input> elements with type=hidden and value=something very hammy.

I believe the tokenizer strips out all HTML tags, at least it makes a good
effort to do so.  It uses a fancy-schmancy regular expression Tim Peters
wrote to make it fast, but I believe it's also limited in what it believes
the maximum length of an HTML tag can be:

    # Cheap-ass gimmick to probabilistically find HTML/XML tags.
    # Note that <style and HTML comments are handled by crack_html_style()
    # and crack_html_comment() instead -- they can be very long, and long
    # minimal matches have a nasty habit of blowing the C stack.
    html_re = re.compile(r"""
        <
        (?![\s<>])  # e.g., don't match 'a < b' or '<<<' or 'i<<5' or 'a<>b'
        # guessing that other tags are usually "short"
        [^>]{0,256} # search for the end '>', but don't run wild
        >
    """, re.VERBOSE | re.DOTALL)

It's not completely obvious, but it appears the <input> tag in your message
contains over 300 characters, so it would be missed by the above regular
expression.  I don't know if it's time to try something different, boost the
above 256 to something larger, or do nothing and rely on more training to
squash that bug.

Do you still have that message so you can post it as an attachment in its
entirety?

Thx,

Skip