[spambayes-dev] Re: [Spambayes] Database cleaning?

Tim Peters tim.one at comcast.net
Sun Jun 1 01:39:45 EDT 2003


[Matthew Dixon Cowles]
> ...
> With an eye toward reducing the size of the database, I instrumented
> the classifier a while ago and found a very strong indication that
> that's true. Indeed, hapaxes often figured in scoring. I didn't
> bother to calculate exact numbers because the results were strong
> enough to persuade me that removing hapaxes wasn't a useful strategy.

The original spambayes code saved a time-of-last-access stamp in each
WordInfo record.  That was to support research into database cleaning
strategies.  The research never happened, though, and several WordInfo
members got tossed to reduce the database size.  If people want to start
research on this again, an official patch set to maintain this kind of info
in researchers' databases would be a real help.

Earlier experiments showed that removing hapaxes was fine *if* you had
trained carefully on many thousands of messages at random.  It also showed
that removing hapaxes was a disaster if you engaged in mistake-based
training alone (that is, never train on anything except misclassifed msgs,
and possibly also unsures -- then you end up with a very small, and also a
very brittle (prone to major ongoing surprises), database).

In hindsight, I'd rephrase this to say that hapax-driven databases need
their hapaxes <wink>.




More information about the spambayes-dev mailing list