[Spambayes] dealing with non-english data.
Tim Peters
tim.one@comcast.net
Mon, 23 Sep 2002 00:27:08 -0400
[Anthony Baxter]
> Hm. So I've just about finished going through the new messages I've
> dumped into my corpus, and I'm trying to narrow down the fp's and fn's.
> There's a _lot_ of stuff in these mailboxes that are non-english and
> non-ascii. At the moment, the tokenizer doesn't do a fabulous job on
> this stuff.
I gave up trying. charset identifiers generate tokens, but everything else
goes thru the same path as English. Note that every "long word" containing
at least 1 high-bit character generates an "8bit%:nn" token, where nn is the
percentage of characters within the word with the sign bit lit.
That much was enough so that I have no Asian-language false negatives (and
therefore there's nothing I can do to test possible improvements -- it's
just not an issue in my test data anymore).
Beyond that, I'm clueless. Greg (Ward) says python.org gets tons of Asian
spam, so when I move to that corpus I'll doubtless be inspired to do better
on it.