Tim Peters tim.one@comcast.net
Mon Nov 4 21:06:53 2002

[Jeremy Hylton]
> I've been training, of late, on a growing sample of my incoming
> email.

Good -- I knew I could browbeat you into that <wink>.

> At the moment just a few hundred of each ham and spam.  It has done
> moderately well.  Apparently the Carter spam used to trigger on
> words in the old archives I was using -- and the new smaller training
> database just doesn't have many occurrences of those words.

Expiring words over time is something that should be done with ongoing
training too ("database pruning").  There's been no progress on that,

> The osaf lists are for Kapor et al.'s new PIM.  I've got 24 messages
> from those lists in my ham training set, but it hasn't been enough to
> get the scores reliably below 0.1.

With just a few hundred training msgs, that's very surprising to me, and
especially since the one example I've seen scored very solidly as ham under
my classifier (which had not been trained on any of these things).  Could
there be a persistence glitch such that training isn't "taking hold"?

I just looked, and noticed that _remove_msg() didn't do the

    self.wordinfo[word] = record

bit at the end which may be needed to tell a persistent DB that the content
of *record* changed.  Then untraining a msg would screw things up, by
decrementing the nspam or nham count but not reducing the word counts to
match.  I'll check in a fix for that now.  Maybe there are other places
"like that".

