[spambayes-bugs] [ spambayes-Patches-824651 ] Japanese (and/or
other CJK languages) message support
SourceForge.net
noreply at sourceforge.net
Sat Nov 29 03:17:32 EST 2003
Patches item #824651, was opened at 2003-10-16 17:23
Message generated for change (Comment added) made by hatukanezumi
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Hatuka*nezumi (hatukanezumi)
Assigned to: Nobody/Anonymous (nobody)
Summary: Japanese (and/or other CJK languages) message support
Initial Comment:
Maybe this also applicable to other East-Asian languages.
o Unicode'ify text:
For example by Japanese message, RFC1468 recommends
that ISO/EIC 2022 encoding scheme, with ASCII and
multibyte character set both designated to GL, should be
used. Original tokenizer generates only bogus
meaningless
text fragments for Japanese messages.
o Concatinate C/J lines.
In Japanese (and maybe Chinese) messages, line folding
often breaks 'words'.
o Bigram of C/J characters.
In Japanese (and often Chinese) messages, 'words' are
not separated by character such as whitespace.
Tokenization to grammatical 'words' will require
heuristic
algorithms using large corpus.
Instead of expensive human-language parser, generate
bigram from run of kanji (ideograph for C/J/K) or run of
hiragana & katakana (syllabic letters for J).
N.B.:
- I believe number of database items is roughly O(n^2)
for bigram, O(n^3) for trigram,... and O(n^i) for
i-gram,
where n is size of used character set. On katakana &
hiragana n is approximately 100. On kanzi it is
approx.
5000 (KS X 1001), 7000 (JIS X 0208), or more (Chinese
standards). By C/J messages, 3-or-more-gram will
generate very sparse and large database.
- Words of single kanzi should not be discarded by
tokenizer. Since most of basic kanzi words are of
1 or 2
characters.
Words of single hiragana/katakana may be discarded.
- As far as I know, in Korean message, phrase (not 'word'
but similar) is often separated by whitespace. As
run of
hangul (syllabic character for K) may not splitted to
n-gram.
o Punctuation --- what is 'punctuation'? A lot of
punctuations, spaces, signs and symbols registered with
Unicode Standard are added to punctuation_run_re (for
compatibility, some of them are overlapped with
subject_words_re). Since many of them are also
registered as punctuations or symbols with C/J/K
character set standards.
Problems:
o sb_dbexpimp.py become incompatible.
o Only BMP range is supported. Surrogates are not
recognized.
o Tested by Japanese messages only, not by other
East-Asian messages.
o No batch tests. This only aims at Japanese support.
Configuration:
o To support unicode, .spambayesrc must be set:
[Tokenizer]
replace_nonascii_chars: False
----------------------------------------------------------------------
>Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-11-29 17:17
Message:
Logged In: YES
user_id=529503
o hammie.py / sb_filter.py / sb_xmlrpcserver.py:
- clues in X-Spambayes-Evidence: header will be
MIME header encoded.
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-11-26 17:31
Message:
Logged In: YES
user_id=529503
server patch 1.0a7-0.6
o Dibbler performs HTTP charset conversion
(to/from internal UTF-8).
o New configuration option: [html_ui] http_charset
----------------------------------------------------------------------
Comment By: Tony Meyer (anadelonbrin)
Date: 2003-11-26 09:13
Message:
Logged In: YES
user_id=552329
Added the sb_dbexpimp.py patch (v1.3). Will look at the
rest, shortly - thanks for your patience!
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-11-11 21:01
Message:
Logged In: YES
user_id=529503
o db_expimp.py is imcompatible again. It exports / imports data
as UTF-8.
o Unicode'ifyed sb_server.py.
- HTTP charset is UTF-8.
- clues in X-Spambayes-Evidences will be MIME header
encoded.
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-29 19:29
Message:
Logged In: YES
user_id=529503
OK. I'll test the code untill addition.
minor fix: 'replace_nonascii_chars' option works correctly, etc.
----------------------------------------------------------------------
Comment By: Tony Meyer (anadelonbrin)
Date: 2003-10-21 12:52
Message:
Logged In: YES
user_id=552329
Just a wee note to say thanks for this, and that someone will
get to looking at adding this in, but everyone's pretty busy
with other stuff at the moment!
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-19 19:02
Message:
Logged In: YES
user_id=529503
fix for Korean message.
Hangul phrases/words can be of 1 or 2 chars.
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-17 18:19
Message:
Logged In: YES
user_id=529503
minor fix.
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-16 21:52
Message:
Logged In: YES
user_id=529503
> ISO/EIC 2022 encoding scheme, with ASCII and
> multibyte character set both designated to GL,
Not 'designate'. 'Invoke' is correct.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702
More information about the Spambayes-bugs
mailing list