[Python-Dev] Python source code encoding

Sun, 16 Apr 2000 17:52:20 +0200

[Fredrik]:
> [MAL]:
> > > To reinforce Fredrik's point here, note that XML only supports
> > > encodings at the level of an entire file (or external entity). You
> > > can't tell an XML parser that a file is in UTF-8, except for this one
> > > element whose contents are in Latin1.
> >
> > Hmm, this would mean that someone who writes:
> >
> > """
> > #pragma script-encoding utf-8
> >
> > u = u"\u1234"
> > print u
> > """
> >
> > would suddenly see "\u1234" as output.
> 
> not necessarily.  consider this XML snippet:
> 
>     <?xml version='1.0' encoding='utf-8'?>
>     <body>&#x1234;</body>
> 
> if I run this through an XML parser and write it
> out as UTF-8, I get:
> 
> <body>á^´</body>
> 
> in other words, the parser processes "&#x" after
> decoding to unicode, not before.
> 
> I see no reason why Python cannot do the same.

Sure, and this is what I meant when I said that the compiler
has to deal with several different encodings. Unicode escape
sequences are currently handled by a special codec, the
unicode-escape codec which reads all characters with ordinal
< 256 as-is (meaning Latin-1, since the first 256 Unicode
ordinals map to Latin-1 characters (*)) except a few escape sequences
which it processes much like the Python parser does for 8-bit
strings and the new \uXXXX escape.

Perhaps we should make this processing use two levels... 
the escape codecs would need some rewriting to process Unicode->
Unicode instead of 8-bit->Unicode as they do now.

--

To move along the method Fredrik is proposing I would suggest
(for Python 1.7) to introduce a preprocessor step which gets executed
even before the tokenizer. The preprocessor step would then
translate char* input into Py_UNICODE* (using an encoding hint which
would have to appear in the first few lines of input using some special
format). The tokenizer could then work on Py_UNICODE* buffer and
the parser would then take care of the conversion from Py_UNICODE*
back to char* for Python's 8-bit strings. It should shout out loud
in case it sees input data outside Unicode range(256) in what is
supposed to be a 8-bit string.

To make this fully functional we would have to change the 8-bit
string to Unicode coercion mechanism, though. It would have to 
make a Latin-1 assumption instead of the current UTF-8 assumption.
In contrast to the current scheme, this assumption would be correct
for all constant strings appearing in source code given the above
preprocessor logic. For strings constructed from file or user input
the programmer would have to assure proper encoding or do the
Unicode conversion himself.

Sidenote:
The UTF-8->Latin-1 change would probably also have to be propogated
to all other Unicode in/output logic -- perhaps Latin-1 is the better
default encoding after all...

A programmer could then write a Python script completely in UTF-8,
UTF-16 or Shift-JIS and the above logic would convert the input
data to Unicode or Latin-1 (which is 8-bit Unicode) as appropriate
and it would warn about impossible conversions to Latin-1 in the
compile step. The programmer would still have to make sure that file
and user input gets converted using the proper encoding, but this 
can easily be done using the stream wrappers in the standard
codecs module.

Note that in this discussion we need to be very careful not
to mangle encodings used for source code and ones used when
reading/writing to files or other streams (including
stdin/stdout).

BTW, to experiment with all this you can use the codecs.EncodedFile
stream wrapper. It allows specifying both data and stream side
encodings, e.g. you can redirect a UTF-8 stdin stream to Latin-1
returning file object which can then be used as source of data
input.

(*) The conversion from Unicode to Latin-1 is similar to converting
    a 2-byte unsigned short to an unsigned byte with some extra logic
    to catch data loss. Latin-1 is comparable to 8-bit Unicode... 
    this is where all this talk about Latin-1 originates from :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/