[Spambayes] Suggestion for HTML analysis

Matthew Dixon Cowles matt at mondoinfo.com
Sun Sep 14 18:02:27 EDT 2003

Dear Tom,

> I'm new to the list.

Hello and welcome.

> recently I've gotten HTML-formatted spam that attempts
> to circumvent recognition by inserting copious amounts of HTML
> garbage tags between letters

> I think Spambayes is fooled by this technique, because I don't see
> any of the operative words in the analysis

Tim Peters added that in May. From the CVS checkin comment:

    I dug into a small collection of Unsures that looked like blatant
    spam, and discovered they were all using this kind of trick:

      Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion

    That is, disguising words by inserting HTML nonsense tags.  We
    replaced each tag with a blank, yielding the pretty useless
    tokens "Wr", "inkle", "Reduc" and "tion".  We previously fixed a
    similar problem using embedded HTML comments.  I should have
    fixed this other one then.


