[Python-Dev] Adding Japanese Codecs to the distro

"Martin v. Löwis" martin@v.loewis.de
Wed, 22 Jan 2003 23:07:10 +0100


M.-A. Lemburg wrote:
> Perhaps Hye-Shik Chang could join you in the effort, since he's
> the author of the KoreanCodecs package which has somewhat
> similar problem scope (that of stateful encodings with a huge
> number of mappings) ?

I believe (without checking in detail) that the "statefulness" is also 
an issue in these codecs.

Many of the CJK encodings aren't stateful beyond being multi-byte 
(except for the iso-2022 ones). IOW, there is a non-trivial state only 
if you process the input byte-for-byte: you have to know whether you are 
a the first or second byte (and what the first byte was if you are at 
the second byte). AFAICT, both Japanese codecs assume that you can 
always look at the second byte when you get the first byte.

Of course, this assumption is wrong if you operate in a stream mode, and 
read the data in, say, chunks of 1024 bytes: such a chunk may split 
exactly between a first and second byte (*).

In these cases, I believe, both codecs would give incorrect results. 
Please correct me if I'm wrong.

Regards,
Martin

(*) The situation is worse for GB 18030, which also has 4-byte encodings.