[spambayes-dev] problems locating messages with bigrams

Skip Montanaro skip at pobox.com
Tue Jan 6 13:21:11 EST 2004


After adding bigram generation (that bloats the reverse map pickle to a
gargantuan size, btw), I used extractmessages.py to locate messages
containing a strongly hammy bigram (prob 0.043) which seemed odd to me:
"bi:nov 2003" and which contributed to a false negative in my test database.
Turns out it's common in mailing list digests, like so:

    ...
    ------------------------------

    Date:    Sun, 23 Nov 2003 16:08:45 -0500
    From:    Alan Rowoth <alanrowoth at MINDSPRING.COM>
    Subject: misdirected postings

    ...

where the beginning of another section of an RFC 934 digest has a few
headers.  They are treated as message body.  Accordingly, if you train on
such digests as ham, you get a flurry of unigrams and bigrams which would be
avoided if they were in the actual headers.  Does the email Parser do the
right thing with MIME digests?  Maybe it needs to be trained to recognize
RFC 934 digests (or I need to remove most digests from my ham database).

Another apparently strongly hammy token (prob 0.092) had me confused for a
bit.  When I ran extractmessages.py to identify the messages containing
'bi:skip:w 20 skip:w 10', only two hams and two spams turned up.  That
should have resulted in a spamprob close to 0.5, not 0.1.  I eventually
figured out that the way I generate bigrams:

    for t in Classifier()._enhance_wordstream(tokenize(msg)):
        ...

uses the current training database to decide which tokens should be
generated, the leading & trailing unigrams or the bigram of the two.  All
possible bigrams are not generated.  I can change that easily enough.  Of
course, that will bloat the pickle file even further, and not really improve
the chances of identifying the actual messages which contribute to the score
in a given message.  I guess generating bigram info to use with
cross-validation results will be an approximation to reality at best.

Any suggestions for improving this situation?

Skip



More information about the spambayes-dev mailing list