[Spambayes] Large false negative ...
Skip Montanaro
skip@pobox.com
Sat, 14 Sep 2002 18:51:35 -0500
>> Here's the set of best discriminators from my latest run:
Tim> Very important: such a list gets printed 5 times during an -n5 run.
Tim> Are you showing the first such list, the last such list, or ...?
Sorry, the last.
>> '8bit%:92' 101 0.99
>> '8bit%:100' 181 0.99
Tim> Do you have a lot of Asian spam?
Oodles and oodles of the crap. In my training set, I have 189 alone which
mention "gb2312" in either the subject or as the charset. Of those that
mention a charset, 638 mention something other than ascii or one of the
iso-8859 charsets.
>> 'x-mailer:none' 96 0.481519
>> 'header:Subject:1' 104 0.497099
>> 'content-type:text/plain' 118 0.405034
>> 'header:To:1' 126 0.495146
>> 'header:Message-Id:1' 146 0.60187
Tim> It's as if it's not finding a message *body* in these cases (else
Tim> it would find *something* better than the mere presence of a "To:"
Tim> line to pick on!). It's also curious that the presence of a
Tim> Message-Id line is a (weak) spam indicator in your data.
Don't know what to make of that. There are some spams with message bodies
consisting of just one or two words.
>> 'from:email addr:aol.com' 135 0.0148768
Tim> You may have the world's only corpus where email from AOL is a Good
Tim> Thing <wink>.
Yes, many of the CEDU and Musi-Cal concert site correspondents are AOLers.
>> 'charset:gb2312' 111 0.99
Tim> Fascinating!
>> 'subject:GB2312' 97 0.99
Tim> Weird.
>> 'idaho' 100 0.01
Tim> Disturbing <wink>.
Tim> The lack of 'url:gif' and 'url:remove' as spam indicators is surprising to
Tim> me.
Tim> The top 4:
>> 'url:cedu' 570 0.01
>> 'email name:cedu' 600 0.01
>> 'subject:cedu' 685 0.01
>> 'cedu' 694 0.01
Yup, there are a lot of CEDU messages in my ham corpus.
>> * There are very few 0.99's. I would have thought some of the
>> obvious spam tripwords would have made it into the set.
Tim> Me too. It's possible that you have a lot of spam of the form
Tim> Anthony warned about:
Tim> multipart/alternative
Tim> text/html
Tim> The real spam is here.
Tim> text/plain
Tim> Something innocuous is here.
Tim> In that case, the tokenizer ignores the text/html part, and only
Tim> looks at the text/plain part.
Maybe the relative size of the two parts could be used to decide if the
text/html section should be retained? Or only emit words from the text/html
section if they don't occur in the text/plain section? (Just thinking out
loud here.)
Tim> You could do a gross test to see whether this is important in your
Tim> data by changing the last line of tokenizer.textparts() from
Tim> return text - redundant_html
Tim> to
Tim> return text + redundant_html
I will
>> I have a ton of unread CEDU mail I dumped into the training set,
Tim> If you haven't read it, how do you know whether it's ham or spam?
It's in my CEDU mailbox, put there by procmail based upon seeing a
Sender: cedu-admin@manatee.mojam.com
header. The CEDU list is a closed (only subscribers can post) list. Also,
I'm the list moderator, so even though I haven't read every message, I have
at least scanned the headers.
Tim> Can you set
Tim> show_false_positives = False
Tim> show_false_negatives = True
Tim> show_charlimit = 500000
Tim> and stick full output from a run on the web somewhere?
I'll give it a go.
Skip