[Bob Coe]
> I think I detect an a priori confidence that the same version of
> the Spambayes classifier, if properly trained, can work effectively
> on both European and Asian languages. I wonder if that confidence
> isn't unduly optimistic. For example, ...

I expect SpamBayes in its current form to be useless for users whose email
is primarily in Asian languages, unless all their ham is Asian and all their
spam is non-Asian.

> I presume that the Spambayes classifier tokenizes the incoming
> character stream according to an algorithm that depends heavily on
> clearly defined word markers (spaces and punctuation marks)

In the message body, we split on whitespace, and that's all.  Punctuation is
a non-SpamBayes concept in the body.  Header tokenization makes many more
assumptions, but the legal characters in email headers are constrained by
standards in Anglo-centric ways.

> that are largely absent, or at least less prominent, in Chinese. But
> if you try to tokenize the individual characters of written Chinese,
> you'll find that they're much more context sensitive than the words
> of an English sentence are. [etc]

The tokenizer pays no attention to any of that.  The most common output when
tokening an Asian language is a long string of synthesized "skip" tokens
(splitting on whitespace yields long strings then, and strings of length
greater than 12 get replaced by a synthesized skip token).  In addition, the
Outlook client has

replace_nonascii_chars: True

enabled, which replaces "control" and "high bit" characters each with a
question mark before tokenization.  So in the Outlook client, the primary
output from parsing the body of an Asian language message is a bunch of
synthesized "skip: ? N" tokens.

That's great for weeding out Asian spam for European and American users, and
sucks for everyone else.  If everyone else wants something better, everyone
else can volunteer to do the extensive research it takes to do something
better <wink>.

