[Spambayes] multiple languages

T. Alexander Popiel popiel at wolfskeep.com
Thu May 29 08:56:15 EDT 2003


In message:  <20030529111420.GA584 at matijek>
             Alex Polite <m2 at plusseven.com> writes:
>I've been using spambayes in conjuction with procmail, mutt and
>fetchmail for some time. I'm very happy with this setup.
>
>Theres one glitch though.
>
>99% of my ham is in Swedish.
>99% of my spam is in English.
>
>Because of this I get quite a number of false negatives written in
>Swedish and  false positives written in english.
>
>Has this problem already been brought to your attention? I assume that
>it affect many European users.

The problem was mentioned briefly in theory, but not really
actively discussed (to the best of my knowledge).  I think the
best way to handle it from the algorithm point of view is to
have three classifiers: one to distinguish Swedish from English,
then one each to distinguish ham from spam in each language.
This is may help the false conclusion rates... but at the cost
of much more complicated training and maintenance.

Actually, the more that I think about it, the less I like the
above suggestion.  It's got too much of a burden on the user
for maintaining multiple classification types... and a major
goal of spambayes is to make stuff simple.  Blah.

Heck, if the amount of ham/spam in each language is very out
of balance, the above trick might not help anyway, just due
to that imbalance.

Ugly.  Nasty problem.  Yuck.

- (another) Alex



More information about the Spambayes mailing list