[Spambayes] My first non-personal personal false positive

Tim Peters tim.one@comcast.net
Sun Nov 10 08:36:10 2002

[Tim, asks for help on a Spanish Unsure]

[François Granger]
> Here are the most probable English equivalents of the Spanish words=
> 'using', 'page', 'have', 'click', 'much', 'but', 'know', 'with',
> 'good', 'this', 'Hi', 'that', 'here', 'the', 'for'
> This illustrate he need for properly balanced training sets and
> re raise the question of language discrimination.

It really doesn't raise it for me:  this was in my personal email, and since
I couldn't read the msg anyway, it may as well have been spam.  I get way
too much email to bother more than 2 seconds with something I can't read.  I
only looked at this one because I'm paying heavy attention to everything the
automatic classifier calls spam.  If I weren't using this system, I would
have thrown out that msg at once.

If I were someone who got any quantity of Spanish ham, the system would have
scored it as ham.  As is, the only Spanish I get is in Spanish spam, so the
system correctly judged it for my personal email mix.

> At least prior language discrimination would allow for a different
> database for each language

Whether that would improve results is a testable hypothesis; I've already
said I doubt it would be helpful, and have no motivation to try such an
experiment myself.

> or for a systematic "unsure" flag for not trained languages.

But I *do* train on Spanish -- and Russian, and Turkish, and Chinese, and
Japanese, and German, and French, and Polish (at least):  in my email mix,
they're all used in spam, aren't used in my ham, and are spam to me because
they're unreadable by me.

> If you put my messages in a Ham training set, you will flag French =
> as ham because of my French sig ;-)

Nope, the system isn't that stupid (or, rather, it is <wink>).  What it will
do is knock down the spamprobs of those words.  Despite that I've got French
spam in my training data, your msg here-- including the French sig --got a
solid ham score, with H=1 (to six significant digits) and S=1.1e-11.  The
strongest spam word in fact came from your sig, spamprob('est')=0.84.  It
didn't matter, because I could actually read most of what you wrote, and it
wasn't trying to sell me Viagra <wink>.

> All these words should rate around 0.5 since they are among the
> most common ones in this language.

If I got any French ham, they would rate around 0.5, but for my personal
email it's Just Fine that they're considered spam words.  It wouldn't be OK
for python.org use, but python.org gets a non-trivial amount of non-English
ham, so it trains there accordingly.

> Le courrier est un moyen de communication. Les gens devraient
> se poser des questions sur les implications politiques des choix (o=
u non
> choix) de leurs outils et technologies. Pour des courriers propres =
> <http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneL=

Indeed <wink>.

