[Python-checkins] r71852 - peps/trunk/pep-0383.txt

Fri Apr 24 22:25:20 CEST 2009

Author: martin.v.loewis
Date: Fri Apr 24 22:25:20 2009
New Revision: 71852

Log:
Accept Lino Mastrodomenico's proposal of always using
low surrogates to represent non-decodable bytes.


Modified:
   peps/trunk/pep-0383.txt

Modified: peps/trunk/pep-0383.txt
==============================================================================

--- peps/trunk/pep-0383.txt	(original)
+++ peps/trunk/pep-0383.txt	Fri Apr 24 22:25:20 2009
@@ -62,25 +62,23 @@
 environmental data to Python str objects.
 
 On POSIX systems, Python currently applies the locale's encoding to
-convert the byte data to Unicode. If the locale's encoding is UTF-8,
-it can represent the full set of Unicode characters, otherwise, only a
-subset is representable. In the latter case, using private-use
-characters to represent these bytes would be an option. For UTF-8,
-doing so would create an ambiguity, as the private-use characters may
-regularly occur in the input also.
+convert the byte data to Unicode. Non-decodable bytes will be
+represented as lone half surrogate codes U+DCxx.
 
 To convert non-decodable bytes, a new error handler "python-escape" is
-introduced, which decodes non-decodable bytes using into a private-use
-character U+F01xx, which is believed to not conflict with private-use
-characters that currently exist in Python codecs.
+introduced, which produces these half surrogates. On encoding, the
+error handler converts the half surrogate back to the corresponding
+byte.
 
 The error handler interface is extended to allow the encode error
 handler to return byte strings immediately, in addition to returning
 Unicode strings which then get encoded again.
 
 If the locale's encoding is UTF-8, the file system encoding is set to
-a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
-(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
+a new encoding "utf-8b", as the regular UTF-8 codec would not
+re-encode half surrogates as single bytes. The UTF-8b codec decodes
+non-decodable bytes (which must be >= 0x80) into half surrogate codes
+U+DC80..U+DCFF.
 
 Discussion
 ==========