[Spambayes] Cute spam trick
Tim Peters
tim.one at comcast.net
Mon Dec 16 00:46:38 EST 2002
[Derek Simkowiak, on embedding other kinds of tags in words]
> ...
> I haven't followed the discussions on HTML handling, but given
> this latest cute trick this other stuff can't be far away.
I don't know, but Tim Stone was right that we strip out all HTML tags, so it
wouldn't help them against this system. They could still work around that,
by including extremely long tags -- our cheap-ass regexp gimmicks are
bounded in how far they'll look ahead when deciding what is and isn't a tag
(we don't even know whether we're looking at HTML, and don't want to chew up
non-HTML text that just happens to contain "<").
Someday I expect we'll need "a real" HTML parser -- but not today <wink>.
The technically cleverest spam I've gotten to date remains an HTML spam that
interspersed legitimate news stories & tech newsgroup postings with the
spam, but specified a tiny font and white-on-white for the legit parts.
Invisible when rendered. I've only seen that once, and part of the downside
of stripping HTML tags is that the classifier will never learn on its own
which HTML tricks are used to get this effect. OTOH, you can't guess
someone's "ham words" without knowing something about them, and personal
information is very expensive for spammers to obtain or exploit.
More information about the Spambayes
mailing list