[Spambayes] Re: Training Spambayes
Mathew Hendry
TJLWBECGSGWU at spammotel.com
Mon Dec 20 19:33:55 CET 2004
On Mon, 20 Dec 2004 15:19:46 +1300, "Tony Meyer" <tameyer at ihug.co.nz> wrote:
>[Kenny]
>>> The "train-on-mistakes-and-unsures" strategy implemented in
>>> the Outlook addin is believed to be the most effective strategy
>>> for most general users.
>
>[Mathew]
>> Is that how the automated training is implemented in the
>> latest CVS versions?
>
>What automated training do you mean? We don't have any automated training
>(other than via command-line scripts), do we?
I mean the "Start Training" button in the training tab. I always assumed
that that trained on everything in the folders selected.
>> I was thinking that the "train on mistakes" approach could be
>> taken a step further, down to the individual token level: all
>> encountered tokens are stored in the database, but only
>> "activated" for filtering when found to be required to filter
>> correctly; that is, when a mistake is found, tokens are
>> activated in order of decreasing significance until
>> classification is correct. Has anyone tried anything like this?
>
>This sounds reasonably similar to "train to exhaustion", which is one of the
>best training methods. SpamBayes has pretty limited support for this at the
>moment, but that is changing. However, it's still on a message-by-message
>basis (i.e. train one message, see if that helps, train one more, see if
>that helps, etc). Doing it per token would take a *long* time - it would
>have to be of great benefit.
I wasn't necessarily thinking of train to exhaustion - it could still be
single-pass, the same as "train on mistakes". It would just be more
selective when correcting for mistakes.
>This is also difficult with SpamBayes specifically, because there is an
>assumption that tokens come in a message 'bag'. This means it's easy to
>remove messages from the database, as long as it's the whole message.
>Changing token counts outside of message 'bags' would cause problems
>(negative counts, etc) if the two schemes were mixed.
Ah, ok.
-- Mat.
More information about the Spambayes
mailing list