[Spambayes] dealing with non-english data.

Tim Peters tim.one@comcast.net
Mon, 23 Sep 2002 00:27:08 -0400


[Anthony Baxter]
> Hm. So I've just about finished going through the new messages I've
> dumped into my corpus, and I'm trying to narrow down the fp's and fn's.
> There's a _lot_ of stuff in these mailboxes that are non-english and
> non-ascii.  At the moment, the tokenizer doesn't do a fabulous job on
> this stuff.

I gave up trying.  charset identifiers generate tokens, but everything else
goes thru the same path as English.  Note that every "long word" containing
at least 1 high-bit character generates an "8bit%:nn" token, where nn is the
percentage of characters within the word with the sign bit lit.

That much was enough so that I have no Asian-language false negatives (and
therefore there's nothing I can do to test possible improvements -- it's
just not an issue in my test data anymore).

Beyond that, I'm clueless.  Greg (Ward) says python.org gets tons of Asian
spam, so when I move to that corpus I'll doubtless be inspired to do better
on it.