[Spambayes] Many users on domain coming up as "possibly spam"

Coe, Bob rcoe at CambridgeMA.GOV
Wed Oct 20 14:34:26 CEST 2004

I don't understand why this very simple and well understood problem is so resistant to solution. Since Spambayes is obviously aware of the imbalance (or it couldn't have quoted the numbers in the log), why can't it simply discount each spam token by a factor of 64? (Make it optional so that those of us who don't experience the problem don't have to get involved.)
The most straightforward solution would be to start throwing messages away when a certain imbalance is reached, but I accept that this may be impossible because the data from all messages get munged together. But simply restating the obvious (that equal numbers are better) doesn't seem very helpful, because most users have little or no control over the available corpus of messages. Note that training on misclassified messages is hardly a solution, because if Spambayes is configured correctly, almost all misclassified messages will be false negatives. (OK, that's not true for this user, but he's an exception. Most people reporting an imbalance problem have too many spams.) 


-----Original Message-----
From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org]On Behalf Of Kenny Pitt
Sent: Wednesday, October 13, 2004 4:02 PM
To: 'Mark Vovchuk'; spambayes at python.org
Subject: RE: [Spambayes] Many users on domain coming up as "possibly spam"

Your problem almost certainly lies here:
# ham trained on: 23319
# spam trained on: 370

Based on the imbalance in the number of messages that you have trained, a single spam token will have approximately 63 times as much influence on the overall score as a single ham token.
For best results, you should train on roughly equal numbers of spam and ham messages.  5x to 10x is probably OK for most people, but 63x is definately pushing the limits.  Your best bet is probably to delete your training database and start over from scratch.  If you train only by using the toolbar buttons when messages are misclassified instead of by training a bunch of existing messages up front then you'll probably get better results.
Kenny Pitt


From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org] On Behalf Of Mark Vovchuk
Sent: Wednesday, October 13, 2004 3:18 PM
To: spambayes at python.org
Subject: [Spambayes] Many users on domain coming up as "possibly spam"

Including myself.  Many people in my organization are coming up as either spam or maybe spam.  I have been trying out spambayes as a way to get off of another product and this is the last hurdle that I cannot overcome.  I have them keep moving each other, and myself, out using the "recover" button but to no avail.  this is one of the clues messages that someone had on an email I sent:

Combined Score: 69% (0.686078)

Internal ham score (*H*): 0.229281
Internal spam score (*S*): 0.601437

# ham trained on: 23319
# spam trained on: 370

17 Significant Tokens

token                               spamprob         #ham  #spam

'subject:odd'                       0.155172            1      0

'url:105957'                        0.155172            1      0

'url:indymedia'                     0.155172            1      0

'url:sandiego'                      0.155172            1      0

'from:none'                         0.3267           1559     12

'to:addr:rob'                       0.334402          753      6

'message-id:invalid'                0.37662          1565     15

'reply-to:none'                     0.397052        22874    239

'header:To:1'                       0.608344        14607    360

'url:shtml'                         0.694677           55      2

'url:org'                           0.709459          619     24

'to:2**0'                           0.744606         7133    330

'to:no real name:2**0'              0.804451         3722    243

'proto:http'                        0.825724         3963    298

'url:10'                            0.850336           21      2

'url:2004'                          0.858892            9      1

'url:en'                            0.963873            2      5


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes/attachments/20041020/818579be/attachment.html

More information about the Spambayes mailing list