[Spambayes] Spam Clues: test

Thu Jan 27 06:44:16 CET 2005

Thanks for the great reply.  

-----Original Message-----
From: Tony Meyer [mailto:tameyer at ihug.co.nz] 
Sent: Wednesday, January 26, 2005 10:17 PM
To: Dave; spambayes at python.org
Subject: RE: [Spambayes] Spam Clues: test

> mail from this sender always goes to my junk suspects,
> though I ALWAYS mark it to "recover from spam"-  any suggestions as to

> how to make it not be spam?
> 
> Combined Score: 38% (0.384155)
> Internal ham score (*H*): 0.891505
> Internal spam score (*S*): 0.659815
> 
> # ham trained on: 23653
> # spam trained on: 1734
> 
> 6 Significant Tokens
> token                               spamprob         #ham  #spam
> 'to:addr:dave'                      0.00140231        160      0
> 'message'                           0.235257         7317    165
> 'from:none'                         0.616347          280     33
> 'message-id:invalid'                0.727912          280     55
> 'subject:test'                      0.799655           13      4
> 'to:no real name:2**0'              0.923187         1314   1159

Very short messages are very difficult, because there is not much for
SpamBayes to work with.  Messages that never travel through the Internet
(Exchange only ones like this) are especially difficult, because there
are no headers to generate tokens from.  It doesn't help right now, but
the 1.1 SpamBayes release tries harder to generate tokens from the
Exchange data available, which should help a bit.

Right now, however, the thing that would help the most is to retrain
from scratch.  We recommend that people try and keep their database
roughly balanced between ham and spam (yours is about 14::1).  If the
database is significantly imbalanced you get oddities like the 'to:no
real name:2**0' token, which has been seen in more ham than spam, but is
a significant spam clue (because comparitively little spam has been
seen).  With a balanced database all the significant tokens in that
message would have been ham clues (it probably would have been a solid
0%).  (25000-odd messages is also a fairly large database - we find that
people generally get good results with smaller databases (say under 1000
messages).

There's a lot of information about training techniques at
<http://entrian.com/sbwiki/TrainingIdeas>, but the simplest one to use
with Outlook is:

 * Remove your existing database (if you like, simply rename it and then
you can always revert to it if you want).

 * Only train on messages that end up in your unsure folder, good
messages that end up in the spam folder, and spam messages that stay in
a (watched) good folder.  If you end up getting only spam messages in
your unsure folder (after a while), consider lowering the threshold (say
to 80%, or maybe 70%).

I hope this is of use!

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.