Doug Wyatt doug at sonosphere.com
Fri Oct 3 10:40:09 EDT 2003

On Oct 3, 2003, at 10:16, Skip Montanaro wrote:
>     Doug> I have to admit this is a clever spam technique -- I've 
> taken a
>     Doug> quick look in the archives and read through tokenizer.py and 
> seen
>     Doug> nothing about it. The trick is that the message has a number 
> of
>     Doug> <input> elements with type=hidden and value=something very 
> hammy.
> I believe the tokenizer strips out all HTML tags, at least it makes a 
> good
> effort to do so.  It uses a fancy-schmancy regular expression Tim 
> Peters
> wrote to make it fast, but I believe it's also limited in what it 
> believes
> the maximum length of an HTML tag can be:
>     # Cheap-ass gimmick to probabilistically find HTML/XML tags.
>     # Note that <style and HTML comments are handled by 
> crack_html_style()
>     # and crack_html_comment() instead -- they can be very long, and 
> long
>     # minimal matches have a nasty habit of blowing the C stack.
>     html_re = re.compile(r"""
>         <
>         (?![\s<>])  # e.g., don't match 'a < b' or '<<<' or 'i<<5' or 
> 'a<>b'
>         # guessing that other tags are usually "short"
>         [^>]{0,256} # search for the end '>', but don't run wild
>     """, re.VERBOSE | re.DOTALL)
> It's not completely obvious, but it appears the <input> tag in your 
> message
> contains over 300 characters, so it would be missed by the above 
> regular
> expression.  I don't know if it's time to try something different, 
> boost the
> above 256 to something larger, or do nothing and rely on more training 
> to
> squash that bug.

Thanks, Skip, I see now ...

Maybe the thing to do would be to continue using the cheap gimmick 
(it's efficient), then make another pass looking for longer elements?

> Do you still have that message so you can post it as an attachment in 
> its
> entirety?

Sure ...


