[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Wed Apr 22 13:48:04 CEST 2009

Martin v. Löwis wrote:

> I'm proposing the following PEP for inclusion into Python 3.1.
> Please comment.
> 
> Regards,
> Martin
> 
> PEP: 383
> Title: Non-decodable Bytes in System Character Interfaces
> Version: $Revision: 71793 $
> Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $
> Author: Martin v. Löwis <martin at v.loewis.de>
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 22-Apr-2009
> Python-Version: 3.1
> Post-History:
> 
> Abstract
> ========
> 
> File names, environment variables, and command line arguments are
> defined as being character data in POSIX; the C APIs however allow
> passing arbitrary bytes - whether these conform to a certain encoding
> or not. This PEP proposes a means of dealing with such irregularities
> by embedding the bytes in character strings in such a way that allows
> recreation of the original byte string.
> 
> Rationale
> =========
> 
> The C char type is a data type that is commonly used to represent both
> character data and bytes. Certain POSIX interfaces are specified and
> widely understood as operating on character data, however, the system
> call interfaces make no assumption on the encoding of these data, and
> pass them on as-is. With Python 3, character strings use a
> Unicode-based internal representation, making it difficult to ignore
> the encoding of byte strings in the same way that the C interfaces can
> ignore the encoding.
> 
> On the other hand, Microsoft Windows NT has correct the original

"correct" -> "corrected"

> design limitation of Unix, and made it explicit in its system
> interfaces that these data (file names, environment variables, command
> line arguments) are indeed character data, by providing a
> Unicode-based API (keeping a C-char-based one for backwards
> compatibility).
> 
> [...]
> 
> Specification
> =============
> 
> On Windows, Python uses the wide character APIs to access
> character-oriented APIs, allowing direct conversion of the
> environmental data to Python str objects.
> 
> On POSIX systems, Python currently applies the locale's encoding to
> convert the byte data to Unicode. If the locale's encoding is UTF-8,
> it can represent the full set of Unicode characters, otherwise, only a
> subset is representable. In the latter case, using private-use
> characters to represent these bytes would be an option. For UTF-8,
> doing so would create an ambiguity, as the private-use characters may
> regularly occur in the input also.
> 
> To convert non-decodable bytes, a new error handler "python-escape" is
> introduced, which decodes non-decodable bytes using into a private-use
> character U+F01xx, which is believed to not conflict with private-use
> characters that currently exist in Python codecs.

Would this mean that real private use characters in the file name would
raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
any error handler.

> The error handler interface is extended to allow the encode error
> handler to return byte strings immediately, in addition to returning
> Unicode strings which then get encoded again.

Then the error callback for encoding would become specific to the target
encoding. Would this mean that the handler checks which encoding is used
and behaves like "strict" if it doesn't recognize the encoding?

> If the locale's encoding is UTF-8, the file system encoding is set to
> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.

Is this done by the codec, or the error handler? If it's done by the
codec I don't see a reason for the "python-escape" error handler.

> Discussion
> ==========
> 
> While providing a uniform API to non-decodable bytes, this interface
> has the limitation that chosen representation only "works" if the data
> get converted back to bytes with the python-escape error handler
> also.

I thought the error handler would be used for decoding.

> Encoding the data with the locale's encoding and the (default)
> strict error handler will raise an exception, encoding them with UTF-8
> will produce non-sensical data.
> 
> For most applications, we assume that they eventually pass data
> received from a system interface back into the same system
> interfaces. For example, and application invoking os.listdir() will

"and" -> "an"

> likely pass the result strings back into APIs like os.stat() or
> open(), which then encodes them back into their original byte
> representation. Applications that need to process the original byte
> strings can obtain them by encoding the character strings with the
> file system encoding, passing "python-escape" as the error handler
> name.

Servus,
   Walter