[spambayes-dev] Re: [Spambayes] Database cleaning?
T. Alexander Popiel
popiel at wolfskeep.com
Mon Jun 2 23:12:19 EDT 2003
In message: <1054609440.76.1032 at sake.mondoinfo.com>
Matthew Dixon Cowles <matt at mondoinfo.com> writes:
>[me]
>> 0-500 uses 62929
>
>[Alex Popiel]
>> Could you give more detail on this bucket? Over 99% of your
>> tokens are here.
>
>Sure. It really should have said 0-499 but I'm sure that everyone
>figured that out. Here it is by 50s. The total is slightly larger
>since I've gotten some mail since the last count.
>
> 0-49 uses 60444
> 50-99 uses 1403
>100-149 uses 466
>150-199 uses 217
>200-249 uses 141
>250-299 uses 78
>300-349 uses 78
>350-399 uses 50
>400-449 uses 38
>450-499 uses 28
Interesting. If you plot this (and your other data, scaled
suitably to reflect the different bucket sizes) on log-log
axes, then you get a straight line (up to the point that the
data becomes too sparse to be useful).
I really ought to instrument my own test db; the next question
I have is "What do the numbers become if you only count uses
where the word prob was outside .4-.6?" Hrm. I think I'm
trying to narrow in on a pruning criterion along the lines of
"If it hasn't contributed to classification more than once
every N days (on average), then it's safe to drop it."
- Alex
More information about the spambayes-dev
mailing list