[Spambayes] Large false negative ...

Sat, 14 Sep 2002 18:51:35 -0500

    >> Here's the set of best discriminators from my latest run:

    Tim> Very important: such a list gets printed 5 times during an -n5 run.
    Tim> Are you showing the first such list, the last such list, or ...? 

Sorry, the last.

    >> '8bit%:92' 101 0.99
    >> '8bit%:100' 181 0.99

    Tim> Do you have a lot of Asian spam?

Oodles and oodles of the crap.  In my training set, I have 189 alone which
mention "gb2312" in either the subject or as the charset.  Of those that
mention a charset, 638 mention something other than ascii or one of the
iso-8859 charsets.

    >> 'x-mailer:none' 96 0.481519
    >> 'header:Subject:1' 104 0.497099
    >> 'content-type:text/plain' 118 0.405034
    >> 'header:To:1' 126 0.495146
    >> 'header:Message-Id:1' 146 0.60187

    Tim> It's as if it's not finding a message *body* in these cases (else
    Tim> it would find *something* better than the mere presence of a "To:"
    Tim> line to pick on!).  It's also curious that the presence of a
    Tim> Message-Id line is a (weak) spam indicator in your data.

Don't know what to make of that.  There are some spams with message bodies
consisting of just one or two words.

    >> 'from:email addr:aol.com' 135 0.0148768

    Tim> You may have the world's only corpus where email from AOL is a Good
    Tim> Thing <wink>.

Yes, many of the CEDU and Musi-Cal concert site correspondents are AOLers.

    >> 'charset:gb2312' 111 0.99

    Tim> Fascinating!

    >> 'subject:GB2312' 97 0.99

    Tim> Weird.

    >> 'idaho' 100 0.01

    Tim> Disturbing <wink>.

    Tim> The lack of 'url:gif' and 'url:remove' as spam indicators is surprising to
    Tim> me.

    Tim> The top 4:

    >> 'url:cedu' 570 0.01
    >> 'email name:cedu' 600 0.01
    >> 'subject:cedu' 685 0.01
    >> 'cedu' 694 0.01

Yup, there are a lot of CEDU messages in my ham corpus.

    >> * There are very few 0.99's.  I would have thought some of the
    >> obvious spam tripwords would have made it into the set.

    Tim> Me too.  It's possible that you have a lot of spam of the form
    Tim> Anthony warned about:

    Tim>     multipart/alternative
    Tim>         text/html
    Tim>             The real spam is here.
    Tim>         text/plain
    Tim>             Something innocuous is here.

    Tim> In that case, the tokenizer ignores the text/html part, and only
    Tim> looks at the text/plain part. 

Maybe the relative size of the two parts could be used to decide if the
text/html section should be retained?  Or only emit words from the text/html
section if they don't occur in the text/plain section?  (Just thinking out
loud here.)

    Tim> You could do a gross test to see whether this is important in your
    Tim> data by changing the last line of tokenizer.textparts() from

    Tim>     return text - redundant_html

    Tim> to

    Tim>     return text + redundant_html

I will 

    >> I have a ton of unread CEDU mail I dumped into the training set,

    Tim> If you haven't read it, how do you know whether it's ham or spam?

It's in my CEDU mailbox, put there by procmail based upon seeing a 

    Sender: cedu-admin@manatee.mojam.com

header.  The CEDU list is a closed (only subscribers can post) list.  Also,
I'm the list moderator, so even though I haven't read every message, I have
at least scanned the headers.

    Tim> Can you set

    Tim>     show_false_positives = False
    Tim>     show_false_negatives = True
    Tim>     show_charlimit = 500000

    Tim> and stick full output from a run on the web somewhere?  

I'll give it a go.

Skip