[Spambayes] RE Spam

skip at pobox.com skip at pobox.com
Tue May 23 19:04:16 CEST 2006


    Amedee> I have noticed that a lot of spam contains disclaimer-ish text.
    Amedee> If I train spambayes with "disclaimed" ham, I fear this will
    Amedee> "pollute" the sb database.  The result might be that any email
    Amedee> with a disclaimer-ish text will get a relatively high ham score.
    Amedee> At the moment, I don't see a solution for this possible problem.
    Amedee> I *could* not train on disclaimed ham, but if most of my
    Amedee> correspondents have such boilerplates, training spambayes won't
    Amedee> be very efficient.

That depends.  Most common English words (most of the words in disclaimers
are probably pretty common) should probably score around 0.5 and thus not be
used in ranking messages, e.g.:

    spamcounts the only which that disclaimer property
    token,nspam,nham,spam prob
    the,3591,844,0.5
    only,782,267,0.5
    which,893,232,0.5
    that,2111,424,0.5
    disclaimer,2,1,0.352062362221
    property,184,50,0.5

After you subtract all the common words, it depends on what's left worth
using.  The approach SpamBayes uses is purely probabilistic (is
"statistical" more accurate?).  The score of any given message is based the
"preponderance of evidence" contained in the non-trivial tokens the message
contains (or which SB synthesizes).

Skip


More information about the SpamBayes mailing list