[Spambayes] To think like a spammer...
Tim Peters
tim.one@comcast.net
Sun, 29 Sep 2002 00:15:16 -0400
[Anthony Baxter]
> ...
> A more ugly one I'm seeing (which is a persistent source of a few fn)
> is HTML email which is a huge slab of javascript, and the message
> text encoded inside the message.
Me too, and especially when whitespace has been squashed out of the
Javascript. The split-on-whitespace strategy only generates 'skip:x' tokens
then, and only one per line of code.
We could certainly do better on this. Part of the *problem* is the HTML
tag-stripping, which removes all direct knowledge about the presence of
script before real tokenization even begins. I expect there are lots of
clues we're missing by stripping HTML, btw; the only ones we catch are
embedded http/https/ftp thingies (which are mined before HTML stripping
takes place).
A cute one: I had one f-n that had about 50 spam phrases hiding in a META
KEYWORDS thingie. Of course we threw all of them away unlooked at.