[Python-Dev] PEP 263 - default encoding

M.-A. Lemburg mal@lemburg.com
Mon, 18 Mar 2002 09:59:30 +0100


Guido van Rossum wrote:
> ...
> I think this will actually work.  Suppose someone uses KOI8-R.
> Presumably they have an editor that reads, writes and displays
> KOI8-R, and their default interpretation of Python's stdout will also
> assume KOI8-R.
> 
> Thus, if their program contains
> 
>     k = "...some KOI8-R string..."
>     print k
> 
> it will print what they want.  If they write this:
> 
>     u = unicode(k, "koi8-r")
> 
> it will also do what they want.  Currently, if they write
> 
>     u = u"...some KOI8-R string..."
> 
> it won't work, but with the PEP, in phase 1, it will do the right
> thing as long as they add a KOI8-R cookie to the file.  The treatment
> of the 8-bit string assigned to k will not change in phase 1.
> 
> But the treatment of k under phase 2 will be, um, interesting, and I'm
> not sure what it should do!!!  Since in phase 2 the entire file will
> be decoded from KOI8-R to Unicode before it's parsed, maybe the best
> thing would be to encode 8-bit string literals back using KOI8-R (in
> general, the encoding given in the encoding cookie).
> 
>     *** MAL, can you think about this? ***

All 8-bit string literals will get re-encoded according to the
specified source code encoding. See PEP Concepts part 3:

"""
    3. Python's tokenizer/compiler combo will need to be updated to
       work as follows:

       1. read the file

       2. decode it into Unicode assuming a fixed per-file encoding

       3. tokenize the Unicode content

       4. compile it, creating Unicode objects from the given Unicode data
          and creating string objects from the Unicode literal data
          by first reencoding the Unicode data into 8-bit string data
          using the given file encoding

       5. variable names and other identifiers will be reencoded into
          8-bit strings using the file encoding to assure backward
          compatibility with the existing implementation
"""

For this to work, the source code encoding will have
to be round-trip safe, that is encoding->Unicode->encoding
must be 1-1.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/