[Spambayes] To think like a spammer...

Tim Peters tim.one@comcast.net
Sun, 29 Sep 2002 00:15:16 -0400


[Anthony Baxter]
> ...
> A more ugly one I'm seeing (which is a persistent source of a few fn)
> is HTML email which is a huge slab of javascript, and the message
> text encoded inside the message.

Me too, and especially when whitespace has been squashed out of the
Javascript.  The split-on-whitespace strategy only generates 'skip:x' tokens
then, and only one per line of code.

We could certainly do better on this.  Part of the *problem* is the HTML
tag-stripping, which removes all direct knowledge about the presence of
script before real tokenization even begins.  I expect there are lots of
clues we're missing by stripping HTML, btw; the only ones we catch are
embedded http/https/ftp thingies (which are mined before HTML stripping
takes place).

A cute one:  I had one f-n that had about 50 spam phrases hiding in a META
KEYWORDS thingie.  Of course we threw all of them away unlooked at.