Character encodings and codecs
nobody at nowhere.com
Sat Feb 1 19:41:41 CET 2003
On Sat, 1 Feb 2003 16:17:09 +0100, "vincent wehren" <v.wehren at home.nl>
>Well, that depends on the original encoding, doesn't it. If it is, let's
>say, a DBCS character set you could maybe check if the last byte of the
>chunk you read is within the leadbyte range of the input character set. If
>the last one's it's a leading byte you know you need to read at least one
>more byte to have the more to have the entire dbcs character. What encodings
>do you want to process?
So I would have to read it in byte by byte and manuall check when I
can make a break. There is now Python module that would make this
easier. I thought thats waht the codec module does but I can't relly
under stand it.
The specific projoct I'm working on now would require readine EUC-JP,
storing characters internally as Unicode, and writing UTF-8.
More information about the Python-list