[spambayes-dev] Re: [Spambayes] Database cleaning?

Matthew Dixon Cowles matt at mondoinfo.com
Tue Jun 3 16:01:56 EDT 2003

[Alex Popiel]
> Interesting.  If you plot this (and your other data, scaled
> suitably to reflect the different bucket sizes) on log-log
> axes, then you get a straight line (up to the point that the
> data becomes too sparse to be useful).

The data seems remarkably uniform to me.

> I really ought to instrument my own test db; the next question
> I have is "What do the numbers become if you only count uses
> where the word prob was outside .4-.6?"

My data is a record of tokens that were actually used in scoring so I
think they were outside [.4-.6]. At least, I haven't fiddled with

> I think I'm trying to narrow in on a pruning criterion along the
> lines of "If it hasn't contributed to classification more than once
> every N days (on average), then it's safe to drop it."

I fiddled my classifier last night to compute several scores and log
them when it computes one. The other scores ignore words that haven't
figured in scoring in one day, one week, two weeks, and thirty days.
A random scroll through the results suggests that they look pretty
promising so far. Of course, given that people report good results
even with minimal training, I guess that's not too surprising.


