[spambayes-dev] Re: [Spambayes] Database cleaning?

T. Alexander Popiel popiel at wolfskeep.com
Mon Jun 2 23:12:19 EDT 2003

In message:  <1054609440.76.1032 at sake.mondoinfo.com>
             Matthew Dixon Cowles <matt at mondoinfo.com> writes:
>>      0-500 uses 62929
>[Alex Popiel]
>> Could you give more detail on this bucket?  Over 99% of your
>> tokens are here.
>Sure. It really should have said 0-499 but I'm sure that everyone
>figured that out. Here it is by 50s. The total is slightly larger
>since I've gotten some mail since the last count.
>   0-49 uses 60444
>  50-99 uses 1403
>100-149 uses 466
>150-199 uses 217
>200-249 uses 141
>250-299 uses 78
>300-349 uses 78
>350-399 uses 50
>400-449 uses 38
>450-499 uses 28

Interesting.  If you plot this (and your other data, scaled
suitably to reflect the different bucket sizes) on log-log
axes, then you get a straight line (up to the point that the
data becomes too sparse to be useful).

I really ought to instrument my own test db; the next question
I have is "What do the numbers become if you only count uses
where the word prob was outside .4-.6?"  Hrm.  I think I'm
trying to narrow in on a pruning criterion along the lines of
"If it hasn't contributed to classification more than once
every N days (on average), then it's safe to drop it."

- Alex

