[Spambayes] Frequency distribution for wordinfo counts?
Brad Clements
bkc at murkworks.com
Sat Feb 14 15:24:27 EST 2004
I'd like to get feedback from folks on the distribution of nham and nspam counts in their
wordinfo databases.
For example, I used sb_dbexpimp to dump my dbm based storage, then loaded it into
excel and did a histogram on nham and nspam.
Here's my nspam distribution
BINS Frequency Percent Total Cumulative % BINS Frequency Cumulative %
0 13272 39.36% 39.36% 1 14834 43.99%
1 14834 43.99% 83.36% 0 13272 83.36%
2 2534 7.52% 90.87% 2 2534 90.87%
3 957 2.84% 93.71% 3 957 93.71%
4 535 1.59% 95.30% 10 655 95.65%
5 310 0.92% 96.22% 4 535 97.24%
10 655 1.94% 98.16% 20 323 98.20%
20 323 0.96% 99.12% 5 310 99.12%
40 166 0.49% 99.61% 40 166 99.61%
80 79 0.23% 99.84% 80 79 99.84%
160 23 0.07% 99.91% 160 23 99.91%
320 23 0.07% 99.98% 320 23 99.98%
640 7 0.02% 100.00% 640 7 100.00%
More 0 0.00% 100.00% More 0 100.00%
So, 44% of the spam tokens are hapaxes, for example.
Anyway, what I'm interested in is the number of tokens whose nspam or nham count is
greater than 255 vs the total number of tokens and ham and spam count.
In my case, only about 30 tokens (out of 33718) have either an nham or nspam count >
255.
I've trained on 410 spam and 133 ham.
Can anyone else provide some numbers for me? I'm also interested in the total byte
size and type of storage.
In my case, the DB storage of 33718 tokens takes 1,318,912 bytes.
--
Brad Clements, bkc at murkworks.com (315)268-1000
http://www.murkworks.com (315)268-9812 Fax
http://www.wecanstopspam.org/ AOL-IM: BKClements
More information about the Spambayes
mailing list