[Spambayes] Re: Spambayes Digest, Vol 52, Issue 26

Robert Woodhead trebor at animeigo.com
Mon Dec 16 13:37:55 EST 2002


>The technically cleverest spam I've gotten to date remains an HTML spam that
>interspersed legitimate news stories & tech newsgroup postings with the
>spam, but specified a tiny font and white-on-white for the legit parts.
>Invisible when rendered.  I've only seen that once, and part of the downside
>of stripping HTML tags is that the classifier will never learn on its own
>which HTML tricks are used to get this effect.  OTOH, you can't guess
>someone's "ham words" without knowing something about them, and personal
>information is very expensive for spammers to obtain or exploit.

I was a bit surprised that you guys haven't run across the embedding 
tricks before.  In my spam parsing, I have the parser spit out all 
not only the words, but also the tokens internal to a tag (< and > 
are considered whitespace), and catenate those words broken up by 
tags.

So

foo<!-- derf -->bar foo<bork>baz <font color=#FFFFFF>

results in output:

derf foobar bork foobaz font color ffffff

Seems to work well.  The state machine for doing this is trivial. 
And the extra stuff you glean from the interior of tags is likely to 
be significant.

R




More information about the Spambayes mailing list