[Spambayes] dealing with non-english data.
Tim Peters
tim.one@comcast.net
Mon, 23 Sep 2002 00:29:56 -0400
[Guido]
> Me neither. But here's something any schmuck with a recent Python
> version can try: use the regular expression \w+ compiled with the re.U
> flag to find maximal strings of word characters according to the
> Unicode locale. This should return to strings of characters for each
> of which u.isalnum() or u == '_'. Then all we need to assume in
> addition is that the Unicode standard defines letter-ness in a useful
> way for Korean and Chinese...
Does part.get_payload(decode=True) produce Unicode when appropriate? I've
no idea. Note that some Asian languages don't use whitespace, and a large
pile of 60-character "words" isn't going to do much good.