[Spambayes] Corpus modules

Wed Nov 13 17:46:24 2002

In message:  <9891913C5BFE87429D71E37F08210CB9183A0D@zeus.sfhq.friskit.com>
             "Piers Haken" <piersh@friskit.com> writes:
>
>> -----Original Message-----
>> From: T. Alexander Popiel [mailto:popiel@wolfskeep.com]=20
>>
>> Also, when a training (or untraining) event occurs, I=20
>> completely trash the second database.  This is warranted in=20
>> most cases, since the number of spam and/or ham has changed,=20
>> and thus (almost) all the spamprobs are invalidated. This=20
>> saves us from needing a dirty flag.
>
>Ouch, isn't this overly expensive for retraining a single message?

No, not really.  That's the whole point; throwing away the entire
database is a lot cheaper than touching every record individually,
which is what update_probabilities does.  I then compute the
spamprobs on demand, instead of doing all of them regardless of if
they're used.

If you don't throw away the old spamprobs in some form when you
(re)train a message, then you're getting invalid results from
the scoring mechanism.  The mechanism I outlined achieves
correctness in the face of dynamically changing training data
with less than a 5% speed penalty, worst case.

- Alex