[Spambayes] SpamBayes feedback
Tony Meyer
tameyer at ihug.co.nz
Sun Oct 22 11:08:06 CEST 2006
> [-] I don't like having a "Junk Suspects" folder. To me, this just
> means
> that I have two spam folders to manage instead of one. As a work
> around, I
> have just set both spam and spam suspects to go to the same Junk
> folder. I
> just made this change, so I don't yet know if this will cause any
> problems,
> but I don't see why it should.
If you've left the thresholds at the defaults (which later messages
indicate you have), you have effectively set both thresholds to
0.15. IOW, any message that scores more than 0.15 is spam. Any
tokens that haven't been seen before score 0.5 - so unknown messages
are much, much higher than your threshold. You'll end up with a
*lot* of false positives. If you're certain you want to not have an
unsure range, (a) consider a filter that isn't aimed at creating one,
or (b) consider a threshold around 0.4 to 0.6.
As to whether there should be one or not:
Most users experience such a low false positive rate that they have
no need to check messages classified as spam. This reduces the
practical “spam workload”, defined as percentage of messages needing
manual checking, to around 2-5% of the total mail stream. Any system
that exhibits a non-trivial false positive rate requires the user to
check all messages classified as spam to ensure that valuable mail is
not lost, dramatically reducing the value of the spam filtering
technology. SpamBayes allows the user to configure the size and
position of the unsure range to ensure the number of messages
classified as unsure is consistent with the user’s comfort level,
training database and risk tolerance of false positives.
A remarkable property of chi-combining is that people have generally
been sympathetic to its ‘unsure’ ratings: people usually agree that
messages classed unsure really are hard to categorize. For example,
commercial HTML email from a company you do business with is quite
likely to score as unsure the first time the classifier sees such a
message from a particular company. Spam and commercial email both use
the language and devices of advertising heavily, so it is hard to
tell them apart.
SpamBayes users typically experience no false positives; this is not
from an inherent strength of SpamBayes over similar statistical (or
other) filters, but as a result of the unsure range. Essentially,
the messages that would otherwise have been false positives are
classified as unsure. The advantage of this system is that the
volume of mail that the user must scan to find errors (both false
positives and false negatives) is greatly reduced; typically between
one and five percent of messages are classified as unsure, which is
generally much lower than the percentage of mail that is spam.
As a result, users are more likely to take the time to scan the
unsure folder than they would be to scan the entire spam folder, more
able to identify the correct classification (rather than missing a
false positive in a crowded spam folder) and more likely to
appropriately train messages therein. The disadvantage of this
system is that the percentage of messages that are classified as
unsure is typically higher than the combined percentage of false
negative and false positive messages obtained when using a classifier
that does not include an unsure range. In simple terms, more
messages must be manually corrected, but fewer messages must be
manually examined.
[Stolen, with minor adaption, from my papers:
http://www.ceas.cc/papers-2004/136.pdf
http://www.massey.ac.nz/~tameyer/research/spambayes/
tameyer_trec_2005.pdf]
If you are concerned with possible false positives, then you will
still scan your spam folder. However, there's scanning, and there's
scanning. If there wasn't an unsure range, and the false positive
rate was around 1%, then you'd have to carefully scan the spam
folder. With a false positive rate close to 0%, you can quickly
flick through the folder, probably just glancing at senders &
subjects. Scanning through the unsure folder, since it contains many
fewer messages than the spam/ham folders, is quick.
Would you rather spend 20 minutes scanning through the spam folder
daily, or 5 minutes scanning the spam folder each week and 2 minutes
scanning the unsure folder each day?
=Tony.Meyer
--
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
More information about the SpamBayes
mailing list