[Spambayes] Size of Spam-Bayes DB in user profile

Jeff Epler jepler at unpythonic.net
Wed Jun 23 10:13:27 EDT 2004


On Fri, Jun 18, 2004 at 08:25:37PM -0400, Lewis, Robert wrote:
> My Spam-Bayes DB file is over 10 MB. This is not the only problem with
> using Spam-Bayes with a roaming profile, but it does trouble me. Is
> there a way to reconfigure the product to run on Windows XP without
> loading up my profile. I log on to many systems in my office, so I don't
> want to carry so much overhead.

My database is about 3 megabytes.  It's in "pickle" format.  I think
that "dbm" format is somewhat larger, so this may not be unusual.  My
database has 1477 ham and 2420 spam, and contains 96835 words.  The
database probably contains training data going back to about January of
this year.  A little experimentation indicates that using a compressed
pickle would decrease this by an additional 50% to 70%, to well under a
megabyte.  For some users, this might be a worthwhile feature to pursue,
but I don't spend much time copying my database over the network.

> I am also troubled by a surprising amount of Spam that gets past
> Spam-Bayes. I receive about 60 Spams/day. Spam-Bayes stops 30,
> suspects another 15, and lets through about 15. I guess it's better
> than nothing, but I was frankly expecting more protection.

My experience is much better than yours.

Since March 7, I've received 76003 messages.  That's something like 700
per day.  I train on errors, and occasionally train on additional ham in
an attempt to equalize the spam and ham counts in the database.

49980 (66%) were detected as spam, and at most 10 of those (.02%) were ham
that I subsequently identified and retrieved.  (about 460 per day)
This includes messages with virus payloads (detected by spambayes, not
another tool), for which I don't have a separate count.  In the last
52000 messages, I gathered an additional statistic: 29555 (56%) of those
messages scored "1.00 (1)" or higher.

1067 (.1%) were tagged as "suspect".  (about 10 per day)  If I had to guess,
I'd say these are at most 1/3 ham and at least 2/3 spam.

The rest (about 25000 messages, 33%) were largely classified correctly.
In most cases, the spam that were not correctly classified were messages
to mailing lists.  SpamBayes can discover a lot of list-related clues in
messages, and so what would otherwise have been a clear spam or a very
high-scoring suspect message can get its rating down into the "ham"
range.

I'm a subscriber to several large-volume mailing lists, and also receive
a large number of automated mailings from systems I administer.  At most
1000 of the messages (about 10 per day) are non-spam non-mailing-list
human-originated messages.  SpamBayes misclassified 2 of these messages
(forwarded copies of not-funny jokes), and has sent a handful more to my
"suspect" folder, where they're very easy to find.  These messages
usually contained substantial forwarded content.

Jeff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20040623/8fa4041e/attachment.bin


More information about the Spambayes mailing list