[Spambayes] Back to language issue (long)

Tim Peters tim_one at email.msn.com
Sat Mar 29 20:55:36 EST 2003

[Tim Stone]
> How interesting.  I wonder if a weakness of spambayes is to
> include a bunch of gibberish tokens that would almost surely not
> be in someone's database, which would tend to drive the spamprob
> strongly towards unknown prob, which is .5 by
> default...  (not that French is gibberish <wink>)  - TimS

That won't work:  an unknown word has, as you say, spamprob 0.5 by default,
and all words with spamprob in (.4, .6) are simply ignored by default.  They
don't affect the score at all.  In Francois's case, it seems clear that he
simply hasn't gotten (trained on) many French renditions of the Nigerian
scam, but has gotten (trained on) significant numbers of French ham.  So
even vanilla French words (like quelque) have strong ham scores for him.  So
long as it remains true that he gets very few French Nigerian scams, they'll
continue to score as ham -- but then, by supposition, they are in fact rare,
so nothing to get excited about.  If French renditions of this spam become
common, the very low ham probs of common French words will approach 0.5 (and
so common French words will become ignored), and the spamprobs of telltale
French words will get much spammier, and the system will nail French spam.

