[Spambayes] Suggestion for HTML analysis

Tim Peters tim.one at comcast.net
Sun Sep 14 18:23:23 EDT 2003


[Tom Bates]
> I'm new to the list. I hope this topic hasn't already been beat to
> death, but recently I've gotten HTML-formatted spam that attempts to
> circumvent recognition by inserting copious amounts of HTML garbage
> tags between letters, like so (an actual sample):
>
> Co<!an zsayjpoa dlweabk  sni o
> hgmysios i gdkqfwvin da  byn
> wkt pt g    py wd
> k!>nsoli<!wuis me l rj mrdc
> ebsi vhviyrz
> auu xxq tp
>  ffpmsck wklzmuyvtb tg u lhk cqny rm
> r
> yb!>dat<!w j t i
> b qdsg
> bm
>
> jhj
> qyjq gbbbej eu
> pf
>  chlhqj  sedz g stb p mbjo ned ybssswbv yg!>ion
>
> All this just to spell the word "Consolidation" without detection. I
> think Spambayes is fooled by this technique, because I don't see any
> of the operative words in the analysis.

It's not fooled, but it's doing something you're not expecting:  testing
showed that tokens longer than 12 characters ("Consolidation" has 13) boost
the database size more than they help classification.  Instead a "summary
token" is generated for each such thing, of the form

    skip:<first_character> <rounded_indication_of_total_length>

Specifically, the token Consolidation generates a

    skip:c 10

token.  I mailed this portion of your email to myself and verified that this
is indeed what it does with it (throws out the junk tags, finds the token
Consolidation, and replaces it with a synthesized 'skip:c 10' token).

> Could Spambayes look for an opening <body> tag and go into HTML
> detection mode?

Spambayes is *always* in "HTML detection mode".  Too much spam relies on
extremely forgiving HTML parsers for spambayes to trigger on niceties like
the existence of a <body> tag -- lots and lots and lots of HTML spam isn't
legal HTML.  So spambayes assumes that everything it gets is HTML, and
applies its own extremely forgiving parsing to it.

> Spambayes is working very well for me. About 40% of spam goes into the
> "maybe" bucket at this point, and that percentage seems to be slowl
> improving.

That's actually pretty bad, but we need more details to suggest why it might
be so bad.  Like which version of the code you're using, which client (e.g.,
Outlook addin, pop3proxy, ...), what you set your ham and spam cutoffs to,
how many total ham you've trained on, ditto spam, ..., and a thousand other
things we'll torture you with one at a time <wink>.




More information about the Spambayes mailing list