[Spambayes] Missing HTML payload

Mon Mar 3 20:50:42 EST 2003

[Mark Hammond]
> The following mail got past SpamBayes.  Looking at the clues, it appears
> that spambayes was missing the HTML body of the message (which
> *does* render almost correctly in Outlook).
>
> I instrumented the "show clues" feature to show *all* message tokens
> found in the body.  As you can see at the very end, the entire body was
> stripped.
>
> I am guessing that we barf on:
>             <td><!--#rotato>
> a comment which is never closed.

That would do it!  tokenizer.py's Stripper class eliminates (via subclasses)
various kinds of bracketed structures, and HTML comments are among them.  I
see that the analyze() method will just ignore any text at and after the
last open-bracket match without a matching end-bracket construct.  This was
neither intentional nor unintentional <wink>.  It seems like it would be
better to replace:

            m = self.find_end(text, end)
            if not m:
                break

with:

            m = self.find_end(text, end)
            if not m:
                pushretained(text[start :])  # add this line
                break

Then the unmatched open-bracket construct, and everything following it, will
be retained.  This will apply to unclosed HTML comments, unclosed style
sheets, unclosed uuencoded sections, and unclosed embedded URLs.  I think
I'm fine with retaining all of those.

> Outlook actually shows this entire tag (ie, literally "<!--#rotato>",
> then displays the rest of the HTML correctly - ie, I guess that we treat
> the comment as unclosed, while Outlook ignores it.

Sounds right.

> Any thoughts?

Nope, not a one <wink>.