[Spambayes] RE Spam
tameyer at ihug.co.nz
Wed May 24 12:26:35 CEST 2006
>>> I have noticed that a lot of spam contains disclaimer-ish text.
>>> If I train spambayes with "disclaimed" ham, I fear this will
>>> "pollute" the sb database. The result might be that any email
>>> with a disclaimer-ish text will get a relatively high ham
>> That depends. Most common English words (most of the words in
>> are probably pretty common) should probably score around 0.5 and
>> thus not
>> be used in ranking messages, e.g.:
> However, English is not my mother language and most of my
> is in Dutch.
> As a consequence, most common English words are quite uncommon for
> me. The
> result is that common English words will score a bit above 0.5.
> not much, but enough to be significant after a while.
Note that they have to be above 0.6 before they are used, and even
then only the 150 strongest tokens are used, so in a longer message
tokens have to be fairly strong to count.
IAC, if you train on both ham and spam with disclaimers, then their
score will remain around 0.5, and so they will have no effect. If
the only messages you received with disclaimers were spam, then they
would be spammy clues, but that would be good (and vice-versa).
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
More information about the SpamBayes