[Fredrik]:
[MAL]:
To reinforce Fredrik's point here, note that XML only supports encodings at the level of an entire file (or external entity). You can't tell an XML parser that a file is in UTF-8, except for this one element whose contents are in Latin1.
Hmm, this would mean that someone who writes:
""" #pragma script-encoding utf-8
u = u"\u1234" print u """
would suddenly see "\u1234" as output.
not necessarily. consider this XML snippet:
<?xml version='1.0' encoding='utf-8'?> <body>ሴ</body>
if I run this through an XML parser and write it out as UTF-8, I get:
<body>á^´</body>
in other words, the parser processes "" after decoding to unicode, not before.
I see no reason why Python cannot do the same.
Sure, and this is what I meant when I said that the compiler has to deal with several different encodings. Unicode escape sequences are currently handled by a special codec, the unicode-escape codec which reads all characters with ordinal < 256 as-is (meaning Latin-1, since the first 256 Unicode ordinals map to Latin-1 characters (*)) except a few escape sequences which it processes much like the Python parser does for 8-bit strings and the new \uXXXX escape. Perhaps we should make this processing use two levels... the escape codecs would need some rewriting to process Unicode-> Unicode instead of 8-bit->Unicode as they do now. -- To move along the method Fredrik is proposing I would suggest (for Python 1.7) to introduce a preprocessor step which gets executed even before the tokenizer. The preprocessor step would then translate char* input into Py_UNICODE* (using an encoding hint which would have to appear in the first few lines of input using some special format). The tokenizer could then work on Py_UNICODE* buffer and the parser would then take care of the conversion from Py_UNICODE* back to char* for Python's 8-bit strings. It should shout out loud in case it sees input data outside Unicode range(256) in what is supposed to be a 8-bit string. To make this fully functional we would have to change the 8-bit string to Unicode coercion mechanism, though. It would have to make a Latin-1 assumption instead of the current UTF-8 assumption. In contrast to the current scheme, this assumption would be correct for all constant strings appearing in source code given the above preprocessor logic. For strings constructed from file or user input the programmer would have to assure proper encoding or do the Unicode conversion himself. Sidenote: The UTF-8->Latin-1 change would probably also have to be propogated to all other Unicode in/output logic -- perhaps Latin-1 is the better default encoding after all... A programmer could then write a Python script completely in UTF-8, UTF-16 or Shift-JIS and the above logic would convert the input data to Unicode or Latin-1 (which is 8-bit Unicode) as appropriate and it would warn about impossible conversions to Latin-1 in the compile step. The programmer would still have to make sure that file and user input gets converted using the proper encoding, but this can easily be done using the stream wrappers in the standard codecs module. Note that in this discussion we need to be very careful not to mangle encodings used for source code and ones used when reading/writing to files or other streams (including stdin/stdout). BTW, to experiment with all this you can use the codecs.EncodedFile stream wrapper. It allows specifying both data and stream side encodings, e.g. you can redirect a UTF-8 stdin stream to Latin-1 returning file object which can then be used as source of data input. (*) The conversion from Unicode to Latin-1 is similar to converting a 2-byte unsigned short to an unsigned byte with some extra logic to catch data loss. Latin-1 is comparable to 8-bit Unicode... this is where all this talk about Latin-1 originates from :-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (1)
-
M.-A. Lemburg