[Spambayes] SpamBayes feedback

Sun Oct 22 11:08:06 CEST 2006

> [-] I don't like having a "Junk Suspects" folder. To me, this just  
> means
> that I have two spam folders to manage instead of one. As a work  
> around, I
> have just set both spam and spam suspects to go to the same Junk  
> folder. I
> just made this change, so I don't yet know if this will cause any  
> problems,
> but I don't see why it should.

If you've left the thresholds at the defaults (which later messages  
indicate you have), you have effectively set both thresholds to  
0.15.  IOW, any message that scores more than 0.15 is spam.  Any  
tokens that haven't been seen before score 0.5 - so unknown messages  
are much, much higher than your threshold.  You'll end up with a  
*lot* of false positives.  If you're certain you want to not have an  
unsure range, (a) consider a filter that isn't aimed at creating one,  
or (b) consider a threshold around 0.4 to 0.6.

As to whether there should be one or not:

Most users experience such a low false positive rate that they have  
no need to check messages classified as spam.  This reduces the  
practical “spam workload”, defined as percentage of messages needing  
manual checking, to around 2-5% of the total mail stream.  Any system  
that exhibits a non-trivial false positive rate requires the user to  
check all messages classified as spam to ensure that valuable mail is  
not lost, dramatically reducing the value of the spam filtering  
technology. SpamBayes allows the user to configure the size and  
position of the unsure range to ensure the number of messages  
classified as unsure is consistent with the user’s comfort level,  
training database and risk tolerance of false positives.

A remarkable property of chi-combining is that people have generally  
been sympathetic to its ‘unsure’ ratings: people usually agree that  
messages classed unsure really are hard to categorize. For example,  
commercial HTML email from a company you do business with is quite  
likely to score as unsure the first time the classifier sees such a  
message from a particular company. Spam and commercial email both use  
the language and devices of advertising heavily, so it is hard to  
tell them apart.

SpamBayes users typically experience no false positives; this is not  
from an inherent strength of SpamBayes over similar statistical (or  
other) filters, but as a result of the unsure range.  Essentially,  
the messages that would otherwise have been false positives are  
classified as unsure.  The advantage of this system is that the  
volume of mail that the user must scan to find errors (both false  
positives and false negatives) is greatly reduced; typically between  
one and five percent of messages are classified as unsure, which is  
generally much lower than the percentage of mail that is spam.

As a result, users are more likely to take the time to scan the  
unsure folder than they would be to scan the entire spam folder, more  
able to identify the correct classification (rather than missing a  
false positive in a crowded spam folder) and more likely to  
appropriately train messages therein.  The disadvantage of this  
system is that the percentage of messages that are classified as  
unsure is typically higher than the combined percentage of false  
negative and false positive messages obtained when using a classifier  
that does not include an unsure range.  In simple terms, more  
messages must be manually corrected, but fewer messages must be  
manually examined.

[Stolen, with minor adaption, from my papers:
http://www.ceas.cc/papers-2004/136.pdf
http://www.massey.ac.nz/~tameyer/research/spambayes/ 
tameyer_trec_2005.pdf]

If you are concerned with possible false positives, then you will  
still scan your spam folder.  However, there's scanning, and there's  
scanning.  If there wasn't an unsure range, and the false positive  
rate was around 1%, then you'd have to carefully scan the spam  
folder.  With a false positive rate close to 0%, you can quickly  
flick through the folder, probably just glancing at senders &  
subjects.  Scanning through the unsure folder, since it contains many  
fewer messages than the spam/ham folders, is quick.

Would you rather spend 20 minutes scanning through the spam folder  
daily, or 5 minutes scanning the spam folder each week and 2 minutes  
scanning the unsure folder each day?

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.