[Spambayes] RE: Spam Clues: Schulz asked me to send you this.
tameyer at ihug.co.nz
Sun Jan 2 23:55:47 CET 2005
> In the meantime, (if you have time), don't you think the
> strategy of pasting a portion of a legitimate message in with
> the Spam is going to be troublesome? Mathematically it seems
> like a problem since half or more of the message wouldn't
> look like spam.
If the 'legitimate message' part is actually comprised of words (tokens)
that are also in messages you have trained as ham, then yes, that could be a
problem (if the ratio of those words is high enough). However, (ignoring
personally tailored messages for the moment) the chances of hitting on words
that happen to be in your database as ham is pretty low, and there's the
additional chance that a word will be used that's actually in your spam
database (this is where an individual filter shines, since a word that's ham
for you could be spam for me).
If the message is tailored to you (say it's a copy of a ham message that you
received), then the chance is much higher that those tokens will be in your
database as ham. However, this raises the cost of sending that spam message
to you, a lot. That sort of spam is extremely rare, since it's much more
cost effective to just send bulk mail out to everyone and rely on those
without (effective) filters to generate your revenue.
It seems like there are two main methods of combating this spamming
technique at the moment: using effective training (particularly training
that keeps the database size small, which greatly reduces the chance of a
random hit), and analysis techniques like DSPAM's "Bayesian Dobly".
> Here's the scoring and a good sample message. The scoring is
> higher than when it arrived because I used it to train as spam.
In the future, it would really help if you could send us clues prior to
training - training changes the clue list drastically, especially with
messages like this.
> # ham trained on: 19365
> # spam trained on: 1719
You have trained on a lot more ham than spam (11.3::1), which is probably
the biggest problem here. SpamBayes works best with approximately even
numbers of ham and spam - with this imbalance everything will look a lot
more like ham.
That's also a fairly large database. It seems that the best results
generally come from fairly small databases (a few hundred messages). It
would definitely be worth retraining from scratch, and seeing if that
resolves the problem. With Outlook, the best method would probably be
'train on mistakes' (i.e. train unsures, false positives, and false
negatives). See <http://entrian.com/sbwiki/TrainingIdeas> for (a lot) more
on training styles.
Since you're retraining, you might also like to try the "use_bigrams"
option, which generally gives good results (and should be good with randomly
appended words) and reduces the required training time. If you'd like to do
this, open the file "default_bayes_customize.ini" in your data directory
(create one if there isn't one already) in a text editor (like notepad or
wordpad). Add these lines (excluding the """) to the end of the file:
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
More information about the Spambayes