[Spambayes] dealing with non-english data.
Mon, 23 Sep 2002 10:58:05 +1000
Hm. So I've just about finished going through the new messages I've dumped
into my corpus, and I'm trying to narrow down the fp's and fn's. There's a
_lot_ of stuff in these mailboxes that are non-english and non-ascii. At the
moment, the tokenizer doesn't do a fabulous job on this stuff. I'm wondering
about doing conversion into the given character set, or else tagging the
words with the character set (if it's non-english).
Unfortunately my knowledge of character set issues is up there with my
knowledge of high-altitude yak milking, but I'd love to know if we've got
anyone on this list who knows more about this - for instance, tokenizing
koi-8r, or euc-kr...
Anthony Baxter <email@example.com>
It's never too late to have a happy childhood.