[spambayes-dev] Re: [Spambayes] Database cleaning?
Matthew Dixon Cowles
matt at mondoinfo.com
Mon Jun 2 21:36:02 EDT 2003
> What I'm suggesting is having each token keep track of its usage
> frequency, and then building a histogram of token vs. frequency,
> with each token only contributing once to the chart. This would
> give an idea of what percentage of tokens are used a lot, as opposed
> to what you've got now (which says that for tokens that are used,
> most will be used again soon).
Here you go. Though this one doesn't seem to be worth a histogram:
Over 30.0 days, 63209 tokens were used in scoring a total of 1107800
times
Largest number of uses 11144, smallest 1
0-500 uses 62929
500-1000 uses 145
1000-1500 uses 36
1500-2000 uses 26
2000-2500 uses 27
2500-3000 uses 10
3000-3500 uses 3
3500-4000 uses 3
4000-4500 uses 18
4500-5000 uses 4
5000-5500 uses 2
5500-6000 uses 1
6000-6500 uses 2
6500-7000 uses 1
7000-7500 uses 0
7500-8000 uses 1
8000-8500 uses 0
8500-9000 uses 0
9000-9500 uses 0
9500-10000 uses 0
10000-10500 uses 0
10500-11000 uses 0
11000-11500 uses 1
That token that was used 11144 times was "content-type:text/plain"
and the next most commonly-used one was "subject:: ".
Regards,
Matt
More information about the spambayes-dev
mailing list