[spambayes-bugs] [ spambayes-Patches-824651 ] Japanese (and so on)
message support
SourceForge.net
noreply at sourceforge.net
Fri Oct 17 05:19:20 EDT 2003
Patches item #824651, was opened at 2003-10-16 17:23
Message generated for change (Comment added) made by hatukanezumi
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Hatuka*nezumi (hatukanezumi)
Assigned to: Nobody/Anonymous (nobody)
Summary: Japanese (and so on) message support
Initial Comment:
Maybe this also applicable to other East-Asian languages.
o Unicode'ify text:
For example by Japanese message, RFC1468 recommends
that ISO/EIC 2022 encoding scheme, with ASCII and
multibyte character set both designated to GL, should be
used. Original tokenizer generates only bogus
meaningless
text fragments for Japanese messages.
o Concatinate C/J lines.
In Japanese (and maybe Chinese) messages, line folding
often breaks 'words'.
o Bigram of C/J characters.
In Japanese (and often Chinese) messages, 'words' are
not separated by character such as whitespace.
Tokenization to grammatical 'words' will require
heuristic
algorithms using large corpus.
Instead of expensive human-language parser, generate
bigram from run of kanji (ideograph for C/J/K) or run of
hiragana & katakana (syllabic letters for J).
N.B.:
- I believe number of database items is roughly O(n^2)
for bigram, O(n^3) for trigram,... and O(n^i) for
i-gram,
where n is size of used character set. On katakana &
hiragana n is approximately 100. On kanzi it is
approx.
5000 (KS X 1001), 7000 (JIS X 0208), or more (Chinese
standards). By C/J messages, 3-or-more-gram will
generate very sparse and large database.
- Words of single kanzi should not be discarded by
tokenizer. Since most of basic kanzi words are of
1 or 2
characters.
Words of single hiragana/katakana may be discarded.
- As far as I know, in Korean message, phrase (not 'word'
but similar) is often separated by whitespace. As
run of
hangul (syllabic character for K) may not splitted to
n-gram.
o Punctuation --- what is 'punctuation'? A lot of
punctuations, spaces, signs and symbols registered with
Unicode Standard are added to punctuation_run_re (for
compatibility, some of them are overlapped with
subject_words_re). Since many of them are also
registered as punctuations or symbols with C/J/K
character set standards.
Problems:
o sb_dbexpimp.py become incompatible.
o Only BMP range is supported. Surrogates are not
recognized.
o Tested by Japanese messages only, not by other
East-Asian messages.
o No batch tests. This only aims at Japanese support.
Configuration:
o To support unicode, .spambayesrc must be set:
[Tokenizer]
replace_nonascii_chars: False
----------------------------------------------------------------------
>Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-17 18:19
Message:
Logged In: YES
user_id=529503
minor fix.
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-16 21:52
Message:
Logged In: YES
user_id=529503
> ISO/EIC 2022 encoding scheme, with ASCII and
> multibyte character set both designated to GL,
Not 'designate'. 'Invoke' is correct.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702
More information about the Spambayes-bugs
mailing list