[Spambayes] multiple languages
Francois Granger
francois.granger at free.fr
Thu May 29 18:34:39 EDT 2003
(Sorry, mis handling of the To: field of this liste !)
At 09:49 -0500 on 29/05/2003, in message Re: [Spambayes] multiple
languages, Skip Montanaro wrote:
> Alex> 99% of my ham is in Swedish.
> Alex> 99% of my spam is in English.
>
> Alex> Because of this I get quite a number of false negatives written in
> Alex> Swedish and false positives written in english.
>
>I believe it's been discussed a bit here, though not recently.
I raised the issue here long time ago and did not got a really good
answer from Tim.
>I'm not sure there's an easy way out of this. If you've saved all your
>training messages you can try deleting a bunch (maybe 75%) of the Swedish
>ham and English spam from your database and retrain on the remaining
>messages. Then starting from that point, only train on the mistakes
>(messages which are completely misclassified or wind up marked "unsure").
>This probably won't improve things immediately, but it should make it easier
>for Swedish spam or English ham to begin to tip the scales.
I am french. I get a similar problem as stated here. I get some
occasional spanish and portuguese spam in addition. I am using the
Pop3proxy version.
I have been using various versions of SpamBayes since Sept 2002.
My current database was created on 1 Feb 2003. I trained on some
(100) messages to start with, then trained mostly on unsure and mis
classified. I kept an eye to the balance of ham/spam as well as
trying to put some english ham in the training set when I trained on
english unsure as spam. and the same for the other combination. I
have now trained on Spam: 639 Ham: 486.
The success rate is astonishing since a long time. I get only few
unsure and no mis calssified messages in either language.
--
Hofstadter's Law :
It always takes longer than you expect, even when you take into
account Hofstadter's Law.
More information about the Spambayes
mailing list