[Python-Dev] bytes type discussion
Stephen J. Turnbull
stephen at xemacs.org
Wed Feb 15 11:06:21 CET 2006
>>>>> "Fred" == Fred L Drake, <fdrake at acm.org> writes:
Fred> On Tuesday 14 February 2006 22:34, Greg Ewing wrote:
>> Seems to me this is a case where you want to be able to change
>> encodings in the middle of reading the stream. You start off
>> reading the data as ascii, and once you've figured out the
>> encoding, you switch to that and carry on reading.
Fred> Not quite. The proper response in this case is often to
Fred> re-start decoding with the correct encoding, since some of
Fred> the data extracted so far may have been decoded incorrectly.
Fred> A very carefully constructed application may be able to go
Fred> back and re-decode any data saved from the stream with the
Fred> previous encoding, but that seems like it would be pretty
Fred> fragile in practice.
I believe GNU Emacs is currently doing this. AIUI, they save
annotations where the codec is known to be non-invertible (eg, two
charset-changing escape sequences in a row). I do think this is
fragile, and a robust application really should buffer everything it's
not sure of decoding correctly.
Fred> There may be cases where switching encoding on the fly makes
Fred> sense, but I'm not aware of any actual examples of where
Fred> that approach would be required.
This is exactly what ISO 2022 formalizes: switching encodings on the
fly.
mboxes of Japanese mail often contain random and unsignaled encoding
changes.
A terminal emulator may need to switch when logging in to a remote
system.
--
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
More information about the Python-Dev
mailing list