[Python-Dev] Adding Japanese Codecs to the distro

Tamito KAJIYAMA kajiyama@grad.sccs.chukyo-u.ac.jp
Thu, 23 Jan 2003 07:50:24 +0900


"Martin v. L=F6wis" <martin@v.loewis.de> writes:
|
| M.-A. Lemburg wrote:
| > Perhaps Hye-Shik Chang could join you in the effort, since he's
| > the author of the KoreanCodecs package which has somewhat
| > similar problem scope (that of stateful encodings with a huge
| > number of mappings) ?
| 
| I believe (without checking in detail) that the "statefulness" is also 
| an issue in these codecs.
| 
| Many of the CJK encodings aren't stateful beyond being multi-byte 
| (except for the iso-2022 ones). IOW, there is a non-trivial state only 
| if you process the input byte-for-byte: you have to know whether you are 
| a the first or second byte (and what the first byte was if you are at 
| the second byte). AFAICT, both Japanese codecs assume that you can 
| always look at the second byte when you get the first byte.

Right, as far as my codecs are concerned.  All decoders in the
JapaneseCodecs package assume that the input byte sequence does
not end in a middle of a multi-byte character.  The iso-2022
decoders even assume that the input sequence is a "valid" text
as defined in RFC1468 (i.e. the text must end in the US ASCII
mode).  However, AFAIK, these assumptions in the decoders seem
well-accepted in the real world applications.

The StreamReader/Writer classes in JapaneseCodecs can cope with
the statefulness, BTW.

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>