[Spambayes] dealing with non-english data.

Anthony Baxter anthony@interlink.com.au
Mon, 23 Sep 2002 10:58:05 +1000


Hm. So I've just about finished going through the new messages I've dumped
into my corpus, and I'm trying to narrow down the fp's and fn's. There's a
_lot_ of stuff in these mailboxes that are non-english and non-ascii. At the
moment, the tokenizer doesn't do a fabulous job on this stuff. I'm wondering
about doing conversion into the given character set, or else tagging the 
words with the character set (if it's non-english). 

Unfortunately my knowledge of character set issues is up there with my
knowledge of high-altitude yak milking, but I'd love to know if we've got
anyone on this list who knows more about this - for instance, tokenizing
koi-8r, or euc-kr...

Anthony
--
Anthony Baxter     <anthony@interlink.com.au>
It's never too late to have a happy childhood.