[Python-checkins] r71793 - peps/trunk/pep-0383.txt

Wed Apr 22 08:42:07 CEST 2009

Author: martin.v.loewis
Date: Wed Apr 22 08:42:06 2009
New Revision: 71793

Log:
Add PEP 383.


Added:
   peps/trunk/pep-0383.txt   (contents, props changed)

Added: peps/trunk/pep-0383.txt
==============================================================================

--- (empty file)
+++ peps/trunk/pep-0383.txt	Wed Apr 22 08:42:06 2009
@@ -0,0 +1,118 @@
+PEP: 383
+Title: Non-decodable Bytes in System Character Interfaces
+Version: $Revision$
+Last-Modified: $Date$
+Author: Martin v. Löwis <martin at v.loewis.de>
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 22-Apr-2009
+Python-Version: 3.1
+Post-History:
+
+Abstract
+========
+
+File names, environment variables, and command line arguments are
+defined as being character data in POSIX; the C APIs however allow
+passing arbitrary bytes - whether these conform to a certain encoding
+or not. This PEP proposes a means of dealing with such irregularities
+by embedding the bytes in character strings in such a way that allows
+recreation of the original byte string.
+
+Rationale
+=========
+
+The C char type is a data type that is commonly used to represent both
+character data and bytes. Certain POSIX interfaces are specified and
+widely understood as operating on character data, however, the system
+call interfaces make no assumption on the encoding of these data, and
+pass them on as-is. With Python 3, character strings use a
+Unicode-based internal representation, making it difficult to ignore
+the encoding of byte strings in the same way that the C interfaces can
+ignore the encoding.
+
+On the other hand, Microsoft Windows NT has correct the original
+design limitation of Unix, and made it explicit in its system
+interfaces that these data (file names, environment variables, command
+line arguments) are indeed character data, by providing a
+Unicode-based API (keeping a C-char-based one for backwards
+compatibility).
+
+For Python 3, one proposed solution is to provide two sets of APIs: a
+byte-oriented one, and a character-oriented one, where the
+character-oriented one would be limited to not being able to represent
+all data accurately. Unfortunately, for Windows, the situation would
+be exactly the opposite: the byte-oriented interface cannot represent
+all data; only the character-oriented API can. As a consequence,
+libraries and applications that want to support all user data in a
+cross-platform manner have to accept mish-mash of bytes and characters
+exactly in the way that caused endless troubles for Python 2.x.
+
+With this PEP, a uniform treatment of these data as characters becomes
+possible. The uniformity is achieved by using specific encoding
+algorithms, meaning that the data can be converted back to bytes on
+POSIX systems only if the same encoding is used.
+
+Specification
+=============
+
+On Windows, Python uses the wide character APIs to access
+character-oriented APIs, allowing direct conversion of the
+environmental data to Python str objects.
+
+On POSIX systems, Python currently applies the locale's encoding to
+convert the byte data to Unicode. If the locale's encoding is UTF-8,
+it can represent the full set of Unicode characters, otherwise, only a
+subset is representable. In the latter case, using private-use
+characters to represent these bytes would be an option. For UTF-8,
+doing so would create an ambiguity, as the private-use characters may
+regularly occur in the input also.
+
+To convert non-decodable bytes, a new error handler "python-escape" is
+introduced, which decodes non-decodable bytes using into a private-use
+character U+F01xx, which is believed to not conflict with private-use
+characters that currently exist in Python codecs.
+
+The error handler interface is extended to allow the encode error
+handler to return byte strings immediately, in addition to returning
+Unicode strings which then get encoded again.
+
+If the locale's encoding is UTF-8, the file system encoding is set to
+a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
+(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
+
+Discussion
+==========
+
+While providing a uniform API to non-decodable bytes, this interface
+has the limitation that chosen representation only "works" if the data
+get converted back to bytes with the python-escape error handler
+also. Encoding the data with the locale's encoding and the (default)
+strict error handler will raise an exception, encoding them with UTF-8
+will produce non-sensical data. 
+
+For most applications, we assume that they eventually pass data
+received from a system interface back into the same system
+interfaces. For example, and application invoking os.listdir() will
+likely pass the result strings back into APIs like os.stat() or
+open(), which then encodes them back into their original byte
+representation. Applications that need to process the original byte
+strings can obtain them by encoding the character strings with the
+file system encoding, passing "python-escape" as the error handler
+name.
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+
+..
+   Local Variables:
+   mode: indented-text
+   indent-tabs-mode: nil
+   sentence-end-double-space: t
+   fill-column: 70
+   coding: utf-8
+   End: