[Spambayes] Leaving for another tool. [BUG + FIX]

Thomas Hruska thruska at cubiclesoft.com
Wed Dec 12 15:23:13 CET 2007


Robert Woodhead wrote:
>> On the plus side, I am noticing a significant difference this time 
>> around.  Trained on just 20 messages so far and it is definitely working 
>> a lot better than my previous approach of training on everything 
>> (60,000+ messages - and took almost 300 messages to reach the same point 
>> I'm at now).  Still have a ways to go before I know for certain. 
>> Training one message at a time is going to take a while.
>>
>>   
> I've been lurking on the list for ages, and have finally gotten a chance 
> to try out spambayes (moved to Thunderbird after gettting fed up with 
> Apple Mail).  I have to echo Thomas' comments; Spambayes should train 
> properly when confronted with common user behavior in the mailreader 
> (ie: she tells spambayes when unsures are spam, and when spam is ham, 
> but usually not when unsures are ham).
> 
> I am probably recapitulating some old suggestions (or even, this is the 
> way that SB works already), but it occurs to me that you can deal with 
> the problem of database growth by simply cutting back the word counts 
> regularly (ie: when the spam or ham word count of any word exceeds some 
> number, divide all the word counts of everything in the database by 2) 
> and then zapping all of the middle-of-the-road noise words to get the 
> total word count down to some reasonable number.  Wouldn't this also 
> deal with evolving spam signatures in a natural manner?

Won't work nearly as effectively as what I suggested and probably 
wouldn't work at all.  Now that I understand roughly how Spambayes 
works, the only CORRECT way (at the moment) to train it is to train 
exactly one message between the last training and the current training - 
all the other messages between those points MUST be discarded.  You 
could have thousands of messages in the queue, but because Spambayes 
doesn't recalculate the classification (a bug that is fixable using my 
proposed fix), you can't train on the rest of the messages unless you 
want a diluted database.


> PS: It isn't immediately obvious from the web-based interface how to zap 
> your database, or exactly what the save+quit button really does.

An option I'd like to see but not nearly as much as my suggested fix.

-- 
Thomas Hruska
CubicleSoft President
Ph: 517-803-4197

*NEW* MyTaskFocus 1.1
Get on task.  Stay on task.

http://www.CubicleSoft.com/MyTaskFocus/



More information about the SpamBayes mailing list