[Spambayes] notes on sb_server

Anthony Baxter anthony at interlink.com.au
Mon Jan 12 01:58:44 EST 2004


>>> "Tony Meyer" wrote
> >   An optional setting to say "discard everything < 0.02 and > 
> > 0.98" would be good - I believe someone offered patches for 
> > this on the weekend.
> 
> Brendon Whateley described patches, but didn't offer them.  His was a bit
> different - it removed messages from the page, rather than defaulting to
> discard, and it presented the option to do so on all review pages.  I like
> yours a lot more than his (although I haven't see his code).  Yours is only
> accessible via the Advanced options page, which seems better to me, and
> doesn't actually stop people missing any false positives/negatives that
> score 1.0/0.0.  I'm +1 on your patch being checked in, but others might like
> his more (Brendon - can you send in a patch?).  Richie - what do you think?

I've found that this produces _vastly_ improved scoring for me, so I've
checked in my version (after changing the option to REAL from INTEGER).

I dumped my previous (TOE for a while, then train on mistakes/unsure)
database this morning, in favour of a non-edge training regime. I haven't
got the time to run a full set of statistics (all available CPU time is
currently being spent reprocessing about 80G of billing data from the 
last 5 years (don't ask)), but under the old training regime, a small but
non-zero number of the random gibberish spams were leaking through with
scores of 0.0 or 0.01. Since I switched to this, none have leaked.

The stats from a few hours of processing today: 

 Total emails trained: Spam: 23  Ham: 42

 SpamBayes has processed 328 messages - 176 (54%) good, 99 (30%) spam 
 and 53 (16%) unsure.

 41 messages were manually classified as good (2 were false positives).
 22 messages were manually classified as spam (0 were false negatives).
 22 unsure messages were manually identified as good, and 15 as spam.

The false positives were because I stupidly trained on a single spam 
first. I'd almost suggest putting in a 'you have to train on 5 spam and
5 ham before it starts filtering' rule into the proxy.

The gibberish spams are being nailed again. The number of messages
being trained on (scoring outside [0.02 - 0.98] has dropped to almost 
zero, as have the unsures.

I have to wonder if making non-edge the default option in the next
release of the code (with advice to toss the training database) isn't
a bad plan.

Anthony
-- 
Anthony Baxter     <anthony at interlink.com.au>   
It's never too late to have a happy childhood.




More information about the Spambayes mailing list