[Spambayes] Some more experiences with the Outlook plugin

Tim Peters tim.one@comcast.net
Wed Nov 13 05:45:30 2002


[Paul Moore]
> * Following on from this, I also see Tim's behaviour of surprising
>   unsure cases (or worse, false negatives!). Worst case recently was a
>   message which scored as solid ham. I trained on it as "Spam", and
>   rescored it. It still scored 5 - solid ham.

[Mark Hammond]
> This too was my experience.  For a while, I did training over a huge
> ham corpus, and spam is still less than 1000 messages.  I had around
> 15:1 ham:spam.  I too trained new ham and spam, and was dissappointed
> to see the score remain almost identical.

Almost identical or exactly identical?  I wasn't looking over your
shoulders, so it's hard to guess <wink>.  I've been noticing that, in my
still heavily hapax-driven teensy classifier, the auto-rescore feature of
the Outlook client never seemed to change my scores either, and for a
hapax-driven classifier that's bizarre.  It turns out that was because it
actually didn't change scores:  the probabilities didn't get updated after
training on the reclassified msg, so "the new score" was in fact exactly
equal to "the old score".  I just checked in a fix for that (unique to the
Outlook client).

BTW, another buglet here looks harder to fix:  if you do a retrain from
scratch in the client, all email that comes in *while* training is in
progress gets scored at exactly 50.  That's because the database being built
isn't useful until it's done being built, but is used for scoring during the
rebuild process.  It won't blow up, but every word has unknown_word_prob
before .update_probabilities() gets called at the end.

So it would be good to retain the old database for concurrent scoring
purposes until the new one is ready to use, or it would be good to delay
scoring msgs until training is complete.  I've refrained from "doing
something" about this because it seems like it would be easy to do after
some mechanism is in place for scanning for unrated msgs at startup (i.e.,
folder events could be disabled for the duration of from-scratch training,
then re-enabled after, and the scan-for-unrated machinery kicked into action
again).




More information about the Spambayes mailing list