[Python-checkins] r71852 - peps/trunk/pep-0383.txt
martin.v.loewis
python-checkins at python.org
Fri Apr 24 22:25:20 CEST 2009
Author: martin.v.loewis
Date: Fri Apr 24 22:25:20 2009
New Revision: 71852
Log:
Accept Lino Mastrodomenico's proposal of always using
low surrogates to represent non-decodable bytes.
Modified:
peps/trunk/pep-0383.txt
Modified: peps/trunk/pep-0383.txt
==============================================================================
--- peps/trunk/pep-0383.txt (original)
+++ peps/trunk/pep-0383.txt Fri Apr 24 22:25:20 2009
@@ -62,25 +62,23 @@
environmental data to Python str objects.
On POSIX systems, Python currently applies the locale's encoding to
-convert the byte data to Unicode. If the locale's encoding is UTF-8,
-it can represent the full set of Unicode characters, otherwise, only a
-subset is representable. In the latter case, using private-use
-characters to represent these bytes would be an option. For UTF-8,
-doing so would create an ambiguity, as the private-use characters may
-regularly occur in the input also.
+convert the byte data to Unicode. Non-decodable bytes will be
+represented as lone half surrogate codes U+DCxx.
To convert non-decodable bytes, a new error handler "python-escape" is
-introduced, which decodes non-decodable bytes using into a private-use
-character U+F01xx, which is believed to not conflict with private-use
-characters that currently exist in Python codecs.
+introduced, which produces these half surrogates. On encoding, the
+error handler converts the half surrogate back to the corresponding
+byte.
The error handler interface is extended to allow the encode error
handler to return byte strings immediately, in addition to returning
Unicode strings which then get encoded again.
If the locale's encoding is UTF-8, the file system encoding is set to
-a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
-(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
+a new encoding "utf-8b", as the regular UTF-8 codec would not
+re-encode half surrogates as single bytes. The UTF-8b codec decodes
+non-decodable bytes (which must be >= 0x80) into half surrogate codes
+U+DC80..U+DCFF.
Discussion
==========
More information about the Python-checkins
mailing list