[Spambayes] Corpus modules
T. Alexander Popiel
Wed Nov 13 17:46:24 2002
In message: <9891913C5BFE87429D71E37F08210CB9183A0D@zeus.sfhq.friskit.com>
"Piers Haken" <email@example.com> writes:
>> -----Original Message-----
>> From: T. Alexander Popiel [mailto:firstname.lastname@example.org]=20
>> Also, when a training (or untraining) event occurs, I=20
>> completely trash the second database. This is warranted in=20
>> most cases, since the number of spam and/or ham has changed,=20
>> and thus (almost) all the spamprobs are invalidated. This=20
>> saves us from needing a dirty flag.
>Ouch, isn't this overly expensive for retraining a single message?
No, not really. That's the whole point; throwing away the entire
database is a lot cheaper than touching every record individually,
which is what update_probabilities does. I then compute the
spamprobs on demand, instead of doing all of them regardless of if
If you don't throw away the old spamprobs in some form when you
(re)train a message, then you're getting invalid results from
the scoring mechanism. The mechanism I outlined achieves
correctness in the face of dynamically changing training data
with less than a 5% speed penalty, worst case.
More information about the Spambayes