[Spambayes] Re: Training Spambayes

Mathew Hendry TJLWBECGSGWU at spammotel.com
Mon Dec 20 19:33:55 CET 2004


On Mon, 20 Dec 2004 15:19:46 +1300, "Tony Meyer" <tameyer at ihug.co.nz> wrote:

>[Kenny]
>>> The "train-on-mistakes-and-unsures" strategy implemented in 
>>> the Outlook addin is believed to be the most effective strategy
>>> for most general users.
>
>[Mathew]
>> Is that how the automated training is implemented in the 
>> latest CVS versions?
>
>What automated training do you mean?  We don't have any automated training
>(other than via command-line scripts), do we?

I mean the "Start Training" button in the training tab. I always assumed
that that trained on everything in the folders selected.

>> I was thinking that the "train on mistakes" approach could be 
>> taken a step further, down to the individual token level: all 
>> encountered tokens are stored in the database, but only 
>> "activated" for filtering when found to be required to filter 
>> correctly; that is, when a mistake is found, tokens are 
>> activated in order of decreasing significance until 
>> classification is correct. Has anyone tried anything like this?
>
>This sounds reasonably similar to "train to exhaustion", which is one of the
>best training methods.  SpamBayes has pretty limited support for this at the
>moment, but that is changing.  However, it's still on a message-by-message
>basis (i.e. train one message, see if that helps, train one more, see if
>that helps, etc).  Doing it per token would take a *long* time - it would
>have to be of great benefit.

I wasn't necessarily thinking of train to exhaustion - it could still be
single-pass, the same as "train on mistakes". It would just be more
selective when correcting for mistakes.

>This is also difficult with SpamBayes specifically, because there is an
>assumption that tokens come in a message 'bag'.  This means it's easy to
>remove messages from the database, as long as it's the whole message.
>Changing token counts outside of message 'bags' would cause problems
>(negative counts, etc) if the two schemes were mixed.

Ah, ok.

-- Mat.




More information about the Spambayes mailing list