[Spambayes] multiple languages

Skip Montanaro skip at pobox.com
Thu May 29 10:49:53 EDT 2003


    Alex> 99% of my ham is in Swedish.
    Alex> 99% of my spam is in English.

    Alex> Because of this I get quite a number of false negatives written in
    Alex> Swedish and false positives written in english.

I believe it's been discussed a bit here, though not recently.  In theory,
if you train on a sufficient number of Swedish spam and English ham it
should gather enough information to significantly reduce the FP/FN problem.
Since the preponderance of your ham is Swedish it's going to take fair bit
of Swedish spam to offset that.  For instance, in reality most of the words
in your Swedish emails (all the common stuff) shouldn't be considered hammy
or spammy, but since you get very little Swedish spam essentially every
Swedish word is considered hammy.  As you train on more and more Swedish
spam, the common Swedish words will become much less strong spam indicators,
leaving the uncommon words used in Swedish ham and Swedish spam as
classifiers.  The inverse will be true for English.

Let me see if I can demonstrate using words from my own database.
"information" is neither strongly hammy nor spammy:

    >>> db["saved state"]
    (5, 8165, 12315)            # 8165 spam, 12315 ham
    >>> db["information"]
    (1335, 1030)                # appears in 1335 spams, 1030 hams

while "viagra" clearly is spammy:

    >>> db["viagra"]
    (63, 2)

For you, "information" is probably a fairly spammy word (unless it's also a
Swedish word).

Essentially all the Spanish email I receive is spam, so common Spanish words
are relatively strong spam indicators for me:

    >>> db["todos"]
    (35, 0)
    >>> db["nosotros"]
    (5, 0)

Still, because I don't get very much Spanish spam, the raw numbers are
rather small.  This is what's happening to your English ham.  Almost all
common English words look spammy.

I'm not sure there's an easy way out of this.  If you've saved all your
training messages you can try deleting a bunch (maybe 75%) of the Swedish
ham and English spam from your database and retrain on the remaining
messages.  Then starting from that point, only train on the mistakes
(messages which are completely misclassified or wind up marked "unsure").
This probably won't improve things immediately, but it should make it easier
for Swedish spam or English ham to begin to tip the scales.

It would be quite helpful if you could try this (or other schemes) out and
let us know what - if anything - works for you.  As you pointed out, as
Spambayes gains more bilingual users this will probably become more of a
problem.  Having some semi-proven (or at least tried) techniques will be
helpful.

(I'm sure Tim Peters can state this all much more eloquently.  Perhaps he'll
step in and clear the air.)

Skip



More information about the Spambayes mailing list