[Spambayes] Frequency distribution for wordinfo counts?

Brad Clements bkc at murkworks.com
Sat Feb 14 15:24:27 EST 2004


I'd like to get feedback from folks on the distribution of nham and nspam counts in their  
wordinfo databases.

For example, I used sb_dbexpimp to dump my dbm based storage, then loaded it into 
excel and did a histogram on nham and nspam.

Here's my nspam distribution


BINS	Frequency	Percent Total	Cumulative %	BINS	Frequency	Cumulative %
0	13272	39.36%	39.36%	1	14834	43.99%
1	14834	43.99%	83.36%	0	13272	83.36%
2	2534	7.52%	90.87%	2	2534	90.87%
3	957	2.84%	93.71%	3	957	93.71%
4	535	1.59%	95.30%	10	655	95.65%
5	310	0.92%	96.22%	4	535	97.24%
10	655	1.94%	98.16%	20	323	98.20%
20	323	0.96%	99.12%	5	310	99.12%
40	166	0.49%	99.61%	40	166	99.61%
80	79	0.23%	99.84%	80	79	99.84%
160	23	0.07%	99.91%	160	23	99.91%
320	23	0.07%	99.98%	320	23	99.98%
640	7	0.02%	100.00%	640	7	100.00%
More	0	0.00%	100.00%	More	0	100.00%


So, 44% of the spam tokens are hapaxes, for example.

Anyway, what I'm interested in is the number of tokens  whose nspam or nham count is 
greater than 255 vs the total number of tokens and ham and spam count.


In my case, only about 30 tokens (out of 33718) have either an nham or nspam count > 
255.

I've trained on 410 spam and 133 ham.

Can anyone else provide some numbers for me?  I'm also interested in the total byte 
size and type of storage. 

In my case, the DB storage of 33718 tokens takes 1,318,912 bytes.

-- 
Brad Clements,                bkc at murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
http://www.wecanstopspam.org/                   AOL-IM: BKClements




More information about the Spambayes mailing list