[spambayes-dev] Missed spam - Spam Clues: bechtel
Tim Peters
tim.one at comcast.net
Sat Aug 2 12:47:45 EDT 2003
[Skip]
> Yeah, this is what I reported the other day. I've gotten a few of
> them and classified them all as spam. The latest one sneaked into my
> "low spam" range (0.81 I think). The url components are being
> classified for me.
>
> The pieces of $RANDOMIZE are turning into significant spam clues, as
> is url:pharm1.
Skip, you speculated before about tokens we could generate to get a better
handle on spam like this. I replied, but you didn't realize it <wink>: it
was in a reply to Sean True about why a new "statistical summary" token of
any flavor isn't likely to help much on its own (just one token of many, and
all tokens have equal weight).
I don't see enough of this stuff to worry about it, but it's clear that the
"white on white" (black on black, etc) trick can be pretty good at dragging
spam scores down to the Unsure range. Anyone care enough to do something
about it <wink>? Simplest thing would be to tokenize color= attributes, but
I don't think that would help much (the words it's hiding would still get
scored, and they're the real problem). Pseudo-parsing HTML is something
I've never liked, but God knows we do plenty of it already ...
More information about the spambayes-dev
mailing list