[Spambayes] dealing with non-english data.

Tim Peters tim.one@comcast.net
Mon, 23 Sep 2002 00:29:56 -0400


[Guido]
> Me neither.  But here's something any schmuck with a recent Python
> version can try: use the regular expression \w+ compiled with the re.U
> flag to find maximal strings of word characters according to the
> Unicode locale.  This should return to strings of characters for each
> of which u.isalnum() or u == '_'.  Then all we need to assume in
> addition is that the Unicode standard defines letter-ness in a useful
> way for Korean and Chinese...

Does part.get_payload(decode=True) produce Unicode when appropriate?  I've
no idea.  Note that some Asian languages don't use whitespace, and a large
pile of 60-character "words" isn't going to do much good.