[Spambayes] Hello + Problems
tim.peters at gmail.com
Sun Jul 18 03:16:32 CEST 2004
[Steven J. Hodgen <steven at twitch.net>]
> I'm new to this list, but have been using Spambayes for a couple of years now.
That's a good trick, since SB didn't exist 2 years ago <wink>.
> I'm writing this because Spambayes is no longer working nearly as effectively
> as before. I get tons and tons of spam, I suppose that I was stupid for not
> mangling my address on Usenet newsgroup postings in the past, but the
> damage is done.
Name-mangling probably doesn't help much. Using "steven" as your
email name probably hurt, as there's no string <= 6 characters that
won't be hit by randomized name generation routinely. Using a real
name guarantees spammers will find you easily. They have no trouble
finding me either, of course, and even on accounts from which I've
*never* posted a message.
> In any case, I recently decided to carefully retrain Spambayes in the hope that
> better training would solve the problem, but it hasn't. In fact, I'm somewhat
> shocked, since I've had such excellent results in the past.
> I'm writing this post hoping either for good advice on how to make Spambayes
> work well again, or to get validation that there are some problems.
The only way to know why SB scores your email the way it does is to
stare at clue listings. That may or may not reveal the source of your
current problems, but it's the only source of actual information there
is to be had. You can often out-think why a rule-based system is
screwing up (like "ah, that regular expression is matching more than
it intended to match"), but there are no rules in SB, just simple
> My suspicion is that spammers are getting much cleverer in their use of
> things like:
> V iag r.a
> And permuting this sort of thing to such an extent that Spambayes can't latch
> on to it. I'm a programmer, and I can see that this would be a difficult problem to
You should enjoy this:
There are 600,426,974,379,824,381,952 ways to spell Viagra
SB doesn't *try* to out-think any of them, though. Doesn't matter
much in my own email mix, although the inability to detect
moron-spelling tricks like this clearly boosts my Unsure rate. The
saving grace is that these msgs rarely have anything that "looks
hammy" to me either, so the unrecognized (ignored) obfuscations don't
often save them from being classed as spam.
> One question, it is my understanding that Spambayes uses "words" as the
> basic unit for scoring. If so, is a space the only character used as a break?
The message body is split purely on whitespace (spaces, tabs, carriage
returns, newlines), except that embedded URLs and email addresses
generate a pile of punctuation-sensitive tokenizations, and most HTML
tags are thrown away unlooked-at. In the email headers, a number of
tokenization gimmicks are used, often specific to the header line
being examined. All of these decisions were driven by test results,
and there's no particular *logic* to them beyond "because tests said
that worked best".
> Any thoughts on this? I'm certain that this business has been rehashed many
> times, since Spambayes has been around for a while now and has an active
> user base, but, well, the spam is driving me nuts…
You have to stare at clue listings for your misclassified messages.
Whenever I've had a problem, that quickly revealed the cause (which
usually turned out to be a message trained into the wrong category).
I wrote the bulk of the tokenization and classification code, so I may
have a slight advantage <wink> in interpreting these listings, but
many others have demonstrated a real flair for this too.
Post some clue listings here, and I'm sure someone will take a whack
at that. It helps to attach the full, raw source of the misclassified
message too, and reveal how many spam and how many ham you've trained
on (and if those numbers aren't roughly equal, the first thing you'll
hear back is that unbalanced training is a known cause of poor
More information about the Spambayes