[Python-Dev] bytes type discussion

Stephen J. Turnbull stephen at xemacs.org
Wed Feb 15 11:06:21 CET 2006


>>>>> "Fred" == Fred L Drake, <fdrake at acm.org> writes:

    Fred> On Tuesday 14 February 2006 22:34, Greg Ewing wrote:

    >> Seems to me this is a case where you want to be able to change
    >> encodings in the middle of reading the stream.  You start off
    >> reading the data as ascii, and once you've figured out the
    >> encoding, you switch to that and carry on reading.

    Fred> Not quite.  The proper response in this case is often to
    Fred> re-start decoding with the correct encoding, since some of
    Fred> the data extracted so far may have been decoded incorrectly.
    Fred> A very carefully constructed application may be able to go
    Fred> back and re-decode any data saved from the stream with the
    Fred> previous encoding, but that seems like it would be pretty
    Fred> fragile in practice.

I believe GNU Emacs is currently doing this.  AIUI, they save
annotations where the codec is known to be non-invertible (eg, two
charset-changing escape sequences in a row).  I do think this is
fragile, and a robust application really should buffer everything it's
not sure of decoding correctly.

    Fred> There may be cases where switching encoding on the fly makes
    Fred> sense, but I'm not aware of any actual examples of where
    Fred> that approach would be required.

This is exactly what ISO 2022 formalizes: switching encodings on the
fly.

mboxes of Japanese mail often contain random and unsignaled encoding
changes.

A terminal emulator may need to switch when logging in to a remote
system.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


More information about the Python-Dev mailing list