[spambayes-bugs] [ spambayes-Patches-824651 ] Japanese (and so on) message support

Fri Oct 17 05:19:20 EDT 2003

Patches item #824651, was opened at 2003-10-16 17:23
Message generated for change (Comment added) made by hatukanezumi
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Hatuka*nezumi (hatukanezumi)
Assigned to: Nobody/Anonymous (nobody)
Summary: Japanese (and so on) message support

Initial Comment:
Maybe this also applicable to other East-Asian languages.

o Unicode'ify text:
  For example by Japanese message, RFC1468 recommends
  that ISO/EIC 2022 encoding scheme, with ASCII and
  multibyte character set both designated to GL, should be
  used.  Original tokenizer generates only bogus
meaningless
  text fragments for Japanese messages.

o Concatinate C/J lines.  
  In Japanese (and maybe Chinese) messages, line folding
  often breaks 'words'.

o Bigram of C/J characters.
  In Japanese (and often Chinese) messages, 'words' are
  not separated by character such as whitespace.
  Tokenization to grammatical 'words' will require
heuristic
  algorithms using large corpus.
  Instead of expensive human-language parser, generate
  bigram from run of kanji (ideograph for C/J/K) or run of
  hiragana & katakana (syllabic letters for J).

  N.B.:
  - I believe number of database items is roughly O(n^2) 
    for bigram, O(n^3) for trigram,... and O(n^i) for
i-gram,
    where n is size of used character set.  On katakana & 
    hiragana n is approximately 100.  On kanzi it is
approx.
    5000 (KS X 1001), 7000 (JIS X 0208), or more (Chinese
    standards).  By C/J messages, 3-or-more-gram will
    generate very sparse and large database.

  - Words of single kanzi should not be discarded by
    tokenizer.  Since most of basic kanzi words are of
1 or 2
    characters.
    Words of single hiragana/katakana may be discarded.

  - As far as I know, in Korean message, phrase (not 'word'
    but similar) is often separated by whitespace. As
run of
    hangul (syllabic character for K) may not splitted to
    n-gram.

o Punctuation --- what is 'punctuation'?  A lot of
  punctuations, spaces, signs and symbols registered with
  Unicode Standard are added to punctuation_run_re (for
  compatibility, some of them are overlapped with
  subject_words_re).  Since many of them are also
  registered as punctuations or symbols with C/J/K
  character set standards.

Problems:

o sb_dbexpimp.py become incompatible.

o Only BMP range is supported.  Surrogates are not
recognized.

o Tested by Japanese messages only, not by other
East-Asian messages.

o No batch tests.  This only aims at Japanese support.

Configuration:

o To support unicode, .spambayesrc must be set:
    [Tokenizer]
    replace_nonascii_chars: False

----------------------------------------------------------------------

>Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-17 18:19

Message:
Logged In: YES 
user_id=529503

minor fix.

----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-16 21:52

Message:
Logged In: YES 
user_id=529503

> ISO/EIC 2022 encoding scheme, with ASCII and
> multibyte character set both designated to GL,

Not 'designate'.  'Invoke' is correct.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702