[spambayes-bugs] [ spambayes-Feature Requests-802341 ] Auto-balancing of ham & spam numbers

Mon Sep 15 08:35:22 EDT 2003

Feature Requests item #802341, was opened at 2003-09-08 09:20
Message generated for change (Comment added) made by rmalayter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702

Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Tony Meyer (anadelonbrin)
Assigned to: Tony Meyer (anadelonbrin)
Summary: Auto-balancing of ham & spam numbers

Initial Comment:
>From spambayes at python.org

"""

What about adding a feature to the plug-in that would 

could the number of messages in each training folder, 

then use a random subsample of each folder (spam or 

ham) as necessary to create a balanced training corpus?

"""

This seems like a reasonable idea (as an option), and 

might work better than the experimental imbalance 

adjustment, which has caused various people difficulties 

(because they are *very* imbalanced).  What do you 

think?

----------------------------------------------------------------------

Comment By: Ryan Malayter (rmalayter)
Date: 2003-09-15 12:35

Message:
Logged In: YES 
user_id=731834

Since I initially came up with this possible feature on 

the mailing list, let me add my two cents. I don't think 

throwing out any "super-spam" is the right approach, since 

there might be some useful "almost-spam" information in 

there. A spam might score 100% because it 

contains 'viagra' and 'lowest' and 'price', fine, and we 

already know about those tokens. But the same "super-

spammy" message might contian a new domain name, or a new 

word like "silagra"; basically any other information that 

is useful in the training database.

That said, I think a good algorithm might be based on 

dates, to make sure the sampling is representative. I 

suggest looking at the received date of the oldest message 

in each corpus, and choosing the most recent of these 

dates. Then we can count all messages from each corpus 

that are newer than this date, and finally, take a random 

subsample of the messages from the corpus which has "more" 

new messages. The subsampling can be done on the fly by 

using an RNG, you might get an error of a few messages in 

each direction, but it won't affect the statistics 

materially and will be easier to implement than keeping 

track of a bunch of message-ids.

An example of my proposal:

1) Spam corpus: 1342, oldest is dated 5/13/2003; Ham 

corpus: 6203, oldest is dated 6/19/2002. So we choose our 

cutoff date to be 5/13/2002.

2) We already know there are 1342 messages in the spam 

corpus newer than this date. We also count up 2987 

messages in the ham corpus newer than this date. So we 

want to choose 1342/2987=46.324% of the messages from the 

ham corpus newer than 5/13/2003.

3) We tokenize and traing on the whole spam corups. Then 

we start through the ham corpus, skipping all messages 

older than 5/13/2003. If we come across a message newer 

than that, we choose a random number between 0 and 1. If 

the random number is less than 0.46324, we train wiht the 

message. At most we should be off by a few dozen messages 

from the desired 1342 trained ham.

This method gives us a balanced training set, with 

representative spam and ham messages from the same time-

frame. What do you think?

Regards,

   -Ryan-

----------------------------------------------------------------------

Comment By: Leonid (leobru)
Date: 2003-09-13 03:02

Message:
Logged In: YES 
user_id=790676

I don't know if it is a generally good idea or not, but I

forward everything that scores as 1.00 spam directly to

/dev/null (this way there is no way to train on it). This

effectively implements the idea "do not train on VERY spammy

spam". Works for me; about 80% of all messages (or 90% of

all spam) is immediately thrown away, and the ham/spam

numbers do not get skewed. 3 months, and not a single

non-spam mass mailing in my spam box (in "unsure" in the

worst case). 

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-09-08 13:09

Message:
Logged In: YES 
user_id=14198

This isn't Outlook specific, so you can have it back :)  The

big problem I see is *what* ones to choose?  Skipping spam

may be possible, but skipping a single ham to train on could

be a huge problem.

Maybe we could train on all spam, then score all spam, then

re-train using only the least spammy spam - but I think the

answer to

http://spambayes.sourceforge.net/faq.html#why-don-t-you-implement-cool-tokenizer-trick-x

may be relevant <wink>

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702