[Spambayes-checkins] website faq.txt,1.45,1.46
Skip Montanaro
montanaro at users.sourceforge.net
Wed Oct 1 10:07:41 EDT 2003
- Previous message: [Spambayes-checkins] spambayes/spambayes/resources help.gif, NONE,
1.1 help_gif.py, NONE, 1.1 .cvsignore, 1.1, 1.2
- Next message: [Spambayes-checkins] website faq.txt,1.46,1.47 quotes.ht,1.6,1.7
- Messages sorted by:
[ date ]
[ thread ]
[ subject ]
[ author ]
Update of /cvsroot/spambayes/website
In directory sc8-pr-cvs1:/tmp/cvs-serv20957
Modified Files:
faq.txt
Log Message:
add a note about using SB with non-English languages
Index: faq.txt
===================================================================
RCS file: /cvsroot/spambayes/website/faq.txt,v
retrieving revision 1.45
retrieving revision 1.46
diff -C2 -d -r1.45 -r1.46
*** faq.txt 25 Sep 2003 13:13:08 -0000 1.45
--- faq.txt 1 Oct 2003 14:07:37 -0000 1.46
***************
*** 588,591 ****
--- 588,618 ----
+ Does SpamBayes work with non-English languages?
+ -----------------------------------------------
+
+ SpamBayes was developed by English-speaking people and has therefore had
+ very little testing with other languages. There are some anecdotal reports
+ that it doesn't work as well with Western European language. It might work
+ very well with them if these default values are changed in the user's ini
+ file:
+
+ [Tokenizer]
+ replace_nonascii_chars: True
+ skip_max_word_size: 12
+
+ The first setting causes all non-ASCII characters to be replaced by a
+ question mark. For non-English languages the setting should probably be
+ False. The second setting causes all words longer than 12 characters to
+ yield a "skip: X NNN" token instead of the word itself, where X is the first
+ letter of the word and NNN is the word length. For languages like German,
+ this can be especially troublesome, because an inordinate number of words
+ will yield tokens like "skip: ? 17" because they are long and start with an
+ accented character.
+
+ Asian languages will be particularly troublesome. The SpamBayes tokenizer
+ splits the message into whitespace-separated tokens. (Many?/Most?/All?)
+ Asian languages don't separate "words" with whitespace, so the entire body
+ of a message will generate little other than "skip: ? NNN" tokens.
+
How do I train SpamBayes (web method)?
--------------------------------------
- Previous message: [Spambayes-checkins] spambayes/spambayes/resources help.gif, NONE,
1.1 help_gif.py, NONE, 1.1 .cvsignore, 1.1, 1.2
- Next message: [Spambayes-checkins] website faq.txt,1.46,1.47 quotes.ht,1.6,1.7
- Messages sorted by:
[ date ]
[ thread ]
[ subject ]
[ author ]
More information about the Spambayes-checkins
mailing list