[spambayes-bugs] [ spambayes-Patches-824651 ] Japanese (and/or other CJK languages) message support

Sat Nov 29 03:17:32 EST 2003

Patches item #824651, was opened at 2003-10-16 17:23
Message generated for change (Comment added) made by hatukanezumi
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Hatuka*nezumi (hatukanezumi)
Assigned to: Nobody/Anonymous (nobody)
Summary: Japanese (and/or other CJK languages) message support

Initial Comment:
Maybe this also applicable to other East-Asian languages.

o Unicode'ify text:
  For example by Japanese message, RFC1468 recommends
  that ISO/EIC 2022 encoding scheme, with ASCII and
  multibyte character set both designated to GL, should be
  used.  Original tokenizer generates only bogus
meaningless
  text fragments for Japanese messages.

o Concatinate C/J lines.  
  In Japanese (and maybe Chinese) messages, line folding
  often breaks 'words'.

o Bigram of C/J characters.
  In Japanese (and often Chinese) messages, 'words' are
  not separated by character such as whitespace.
  Tokenization to grammatical 'words' will require
heuristic
  algorithms using large corpus.
  Instead of expensive human-language parser, generate
  bigram from run of kanji (ideograph for C/J/K) or run of
  hiragana & katakana (syllabic letters for J).

  N.B.:
  - I believe number of database items is roughly O(n^2) 
    for bigram, O(n^3) for trigram,... and O(n^i) for
i-gram,
    where n is size of used character set.  On katakana & 
    hiragana n is approximately 100.  On kanzi it is
approx.
    5000 (KS X 1001), 7000 (JIS X 0208), or more (Chinese
    standards).  By C/J messages, 3-or-more-gram will
    generate very sparse and large database.

  - Words of single kanzi should not be discarded by
    tokenizer.  Since most of basic kanzi words are of
1 or 2
    characters.
    Words of single hiragana/katakana may be discarded.

  - As far as I know, in Korean message, phrase (not 'word'
    but similar) is often separated by whitespace. As
run of
    hangul (syllabic character for K) may not splitted to
    n-gram.

o Punctuation --- what is 'punctuation'?  A lot of
  punctuations, spaces, signs and symbols registered with
  Unicode Standard are added to punctuation_run_re (for
  compatibility, some of them are overlapped with
  subject_words_re).  Since many of them are also
  registered as punctuations or symbols with C/J/K
  character set standards.

Problems:

o sb_dbexpimp.py become incompatible.

o Only BMP range is supported.  Surrogates are not
recognized.

o Tested by Japanese messages only, not by other
East-Asian messages.

o No batch tests.  This only aims at Japanese support.

Configuration:

o To support unicode, .spambayesrc must be set:
    [Tokenizer]
    replace_nonascii_chars: False

----------------------------------------------------------------------

>Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-11-29 17:17

Message:
Logged In: YES 
user_id=529503

o hammie.py / sb_filter.py / sb_xmlrpcserver.py:
  - clues in X-Spambayes-Evidence: header will be 
    MIME header encoded.

----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-11-26 17:31

Message:
Logged In: YES 
user_id=529503

server patch 1.0a7-0.6

o Dibbler performs HTTP charset conversion
  (to/from internal UTF-8).
o New configuration option: [html_ui] http_charset

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-11-26 09:13

Message:
Logged In: YES 
user_id=552329

Added the sb_dbexpimp.py patch (v1.3).  Will look at the 
rest, shortly - thanks for your patience!

----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-11-11 21:01

Message:
Logged In: YES 
user_id=529503

o db_expimp.py is imcompatible again. It exports / imports data 
  as UTF-8.

o Unicode'ifyed sb_server.py.
  - HTTP charset is UTF-8.
  - clues in X-Spambayes-Evidences will be MIME header 
    encoded.

----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-29 19:29

Message:
Logged In: YES 
user_id=529503

OK. I'll test the code untill addition.

minor fix: 'replace_nonascii_chars' option works correctly, etc.

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-10-21 12:52

Message:
Logged In: YES 
user_id=552329

Just a wee note to say thanks for this, and that someone will 
get to looking at adding this in, but everyone's pretty busy 
with other stuff at the moment!

----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-19 19:02

Message:
Logged In: YES 
user_id=529503

fix for Korean message.
Hangul phrases/words can be of 1 or 2 chars.

----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-17 18:19

Message:
Logged In: YES 
user_id=529503

minor fix.

----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-16 21:52

Message:
Logged In: YES 
user_id=529503

> ISO/EIC 2022 encoding scheme, with ASCII and
> multibyte character set both designated to GL,

Not 'designate'.  'Invoke' is correct.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702