[Spambayes] RE Spam

skip at pobox.com skip at pobox.com
Wed May 24 12:48:41 CEST 2006


    Amedee> However, English is not my mother language and most of my
    Amedee> correspondence is in Dutch.  As a consequence, most common
    Amedee> English words are quite uncommon for me. The result is that
    Amedee> common English words will score a bit above 0.5. Perhaps not
    Amedee> much, but enough to be significant after a while.

Thanks, I didn't realize that.  Do you have an example in your training
database you can share with us (both message and word scores) where you
think the English disclaimer text has tipped the scales and caused a ham
message to later be scored as spam?  If you simple train on one or two of
those misclassified hams does the problem go away?  How skewed is your
training database (number of spams vs number of hams)?  Have you considered
throwing out your current training database and starting fresh?  

One thing that might help is to further break messages which score as spam
into "low" and "high" spam.  Based on my current settings that gives me
these four categories:

    ham         0.00-0.14
    unsure      0.15-0.59
    low spam    0.60-0.74
    high spam   0.75-1.00

High spam is tossed without further consideration.  Ham is sorted in the
appropriate mailbox by procmail.  Unsure and low spam messages each wind up
in their own mailboxes for further consideration.  I train on most unsure
messages but only train on lospams which are actually ham.

My suspicion is that if you have ham messages which are erroneously winding
up as spam they are at the very low end of the spam scale.  It might be
sufficient to move your spam threshold up a bit so they are more likely to
land in the unsure category.

Skip


More information about the SpamBayes mailing list