[Spambayes] understanding high false negative rate
Jeremy Hylton
jeremy@alum.mit.edu
Fri, 6 Sep 2002 20:00:14 -0400
>>>>> "TP" == Tim Peters <tim.one@comcast.net> writes:
>> The false positive rate is 0-3%. (Finally! I had to scrub a
>> bunch of previously unnoticed spam from my inbox.) Both
>> collections have about 1100 messages.
TP> Does this mean you trained on about 1100 of each?
The total collections are 1100 messages. I trained with 1100/5
messages.
TP> Can't guess. You're in a good position to start adding more
TP> headers into the analysis, though. For example, an easy start
TP> would be to uncomment the header-counting lines in tokenize()
TP> (look for "Anthony"). Likely the most valuable thing it's
TP> missing then is some special parsing and tagging of Received
TP> headers.
I tried the "Anthony" stuff, but it didn't make any appreciable
difference that I could see from staring at the false negative rate.
The numbers are big enough that a quick eyeball suffices.
Then I tried a dirt simple tokenizer for the headers that tokenize the
words in the header and emitted like this "%s: %s" % (hdr, word).
That worked too well :-). The received and date headers helped the
classifier discover that most of my spam is old and most of my ham is
new.
So I tried a slightly more complex one that skipped received, data,
and x-from_, which all contained timestamps. I also skipped the X-VM-
headers that my mail reader added:
class MyTokenizer(Tokenizer):
skip = {'received': 1,
'date': 1,
'x-from_': 1,
}
def tokenize_headers(self, msg):
for k, v in msg.items():
k = k.lower()
if k in self.skip or k.startswith('x-vm'):
continue
for w in subject_word_re.findall(v):
for t in tokenize_word(w):
yield "%s:%s" % (k, t)
This did moderately better. The false negative rate is 7-21% over the
tests performed so far. This is versus 11-28% for the previous test
run that used the timtest header tokenizer.
It's interesting to see that the best descriminators are all ham
discriminators. There's not a single spam-indicator in the list.
Most of the discriminators are header fields. One thing to note is
that the presence of Mailman-generated headers is a strong non-spam
indicator. That matches my intuition: I got an awful lot of
Mailman-generated mail, and those lists are pretty good at surpressing
spam. The other thing is that I get a lot of ham from people who use
XEmacs. That's probably Barry, Guido, Fred, and me :-).
One final note. It looks like many of the false positives are from
people I've never met with questions about Shakespeare. They often
start with stuff like:
> Dear Sir/Madam,
>
> May I please take some of your precious time to ask you to help me to find a
> solution to a problem that is worrying me greatly. I am old science student
I guess that reads a lot like spam :-(.
Jeremy
238 hams & 221 spams
false positive: 2.10084033613
false negative: 9.50226244344
new false positives: []
new false negatives: []
best discriminators:
'x-mailscanner:clean' 671 0.0483425
'x-spam-status:IN_REP_TO' 679 0.01
'delivered-to:skip:s 10' 691 0.0829876
'x-mailer:Lucid' 699 0.01
'x-mailer:XEmacs' 699 0.01
'x-mailer:patch' 699 0.01
'x-mailer:under' 709 0.01
'x-mailscanner:Found' 716 0.0479124
'cc:zope.com' 718 0.01
"i'll" 750 0.01
'references:skip:1 20' 767 0.01
'rossum' 795 0.01
'x-spam-status:skip:S 10' 825 0.01
'van' 850 0.01
'http0:zope' 869 0.01
'email addr:zope' 883 0.01
'from:python.org' 895 0.01
'to:jeremy' 902 0.185401
'zope' 984 0.01
'list-archive:skip:m 10' 1058 0.01
'list-subscribe:skip:m 10' 1058 0.01
'list-unsubscribe:skip:m 10' 1058 0.01
'from:zope.com' 1098 0.01
'return-path:zope.com' 1115 0.01
'wrote:' 1129 0.01
'jeremy' 1150 0.01
'email addr:python' 1257 0.01
'x-mailman-version:2.0.13' 1311 0.01
'x-mailman-version:101270' 1395 0.01
'python' 1401 0.01