[spambayes-bugs] [ spambayes-Feature Requests-1000427 ] non-English spam; localized filters

SourceForge.net noreply at sourceforge.net
Sat May 20 13:45:17 CEST 2006


Feature Requests item #1000427, was opened at 2004-07-29 17:07
Message generated for change (Comment added) made by seier
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1000427&group_id=61702

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Michael Engel (mkengel)
Assigned to: Nobody/Anonymous (nobody)
Summary: non-English spam; localized filters

Initial Comment:

How to deal with

spam in a mixture of English/non-English mails* - it
seems that they pass easily the filters
* in my case English/German/French and Japanese

Solution idea: localized filters, one after the other;
should be possible to choose upon installation


----------------------------------------------------------------------

Comment By: Christian Blackburn (seier)
Date: 2006-05-20 04:45

Message:
Logged In: YES 
user_id=561770

Hi Gang,

I think it's very important to be able to detect spam coming
from a particular language.  However, I think during
installation the user should be asked what language(s) they
speak and any message that qualifies as not being from one
of their chosen languages, that also didn't originate from a
friend (someone in their address book) should be deleted. 
If it is from a known user, it would be awesome if that
person was written back reminding them that you only speak
swahili (obviously, just for example), and that all messages
must be sent in that language.    

Thanks,
Christian Blackburn

----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2004-10-05 07:13

Message:
Logged In: YES 
user_id=529503

I'm working on non-English / multi-lingual tokenizer.
See patch #824651.

* This isn't compatible with original spambayes.


----------------------------------------------------------------------

Comment By: Michael Engel (mkengel)
Date: 2004-08-09 00:04

Message:
Logged In: YES 
user_id=780774

Thank you for the comments.

I have waited a little bit to see if the training on German
spam had an effect.
It did, after a total of 4 weeks, SpamBayes now discovers
these messages as spam (0.44 - my cutoff line is 0.35).

Probably, there were not enough messages in German and
French that SpamBayes could see the difference.



----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2004-08-02 23:42

Message:
Logged In: YES 
user_id=552329

Is your ham also mixed language?  With
English/German/French, SpamBayes doesn't care about the
language and will just learn each word as good/bad, so
should work fine (with appropriate training).  Have you
trained on these sorts of spam?  Attaching the clues for a
misclassified message would give more insight into this.

The Japanese is more difficult, because SpamBayes creates
tokens by (mostly) splitting on whitespace, and this isn't
how Asian languages work (we would get sentence tokens, I
think).  It's unlikely that we will ever handle this well,
and the best solution would be to have someone (willing to
do all the work) create a forked project that has a
different tokeniser, customised for Asian langauges.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1000427&group_id=61702


More information about the Spambayes-bugs mailing list