[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
agbauer at gmail.com
Mon Apr 27 02:59:54 CEST 2009
How about another str-like type, a sequence of char-or-bytes? Could be
called strbytes or stringwithinvalidcharacters. It would support
whatever subset of str functionality makes sense / is easy to
implement plus a to_escaped_str() method (that does the escaping the
PEP talks about) for people who want to use regexes or other str-only
Here is a description by example:
os.listdir('.') -> [strbytes('normal_file'), strbytes('bad', 128, 'file')]
strbytes('a') -> strbytes('a')
strbytes('bad', 128, 'file') -> strbytes(128)
strbytes('bad', 128, 'file').to_escaped_str() -> 'bad?128file'
Having a separate type is cleaner than a "str that isn't exactly what
it represents". And making the escaping an explicit (but
rarely-needed) step would be less surprising for users. Anyway, I
don't know a whole lot about this issue so there may an obvious reason
this is a bad idea.
On Wed, Apr 22, 2009 at 6:50 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> I'm proposing the following PEP for inclusion into Python 3.1.
> Please comment.
> PEP: 383
> Title: Non-decodable Bytes in System Character Interfaces
> Version: $Revision: 71793 $
> Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $
> Author: Martin v. Löwis <martin at v.loewis.de>
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 22-Apr-2009
> Python-Version: 3.1
> File names, environment variables, and command line arguments are
> defined as being character data in POSIX; the C APIs however allow
> passing arbitrary bytes - whether these conform to a certain encoding
> or not. This PEP proposes a means of dealing with such irregularities
> by embedding the bytes in character strings in such a way that allows
> recreation of the original byte string.
> The C char type is a data type that is commonly used to represent both
> character data and bytes. Certain POSIX interfaces are specified and
> widely understood as operating on character data, however, the system
> call interfaces make no assumption on the encoding of these data, and
> pass them on as-is. With Python 3, character strings use a
> Unicode-based internal representation, making it difficult to ignore
> the encoding of byte strings in the same way that the C interfaces can
> ignore the encoding.
> On the other hand, Microsoft Windows NT has correct the original
> design limitation of Unix, and made it explicit in its system
> interfaces that these data (file names, environment variables, command
> line arguments) are indeed character data, by providing a
> Unicode-based API (keeping a C-char-based one for backwards
> For Python 3, one proposed solution is to provide two sets of APIs: a
> byte-oriented one, and a character-oriented one, where the
> character-oriented one would be limited to not being able to represent
> all data accurately. Unfortunately, for Windows, the situation would
> be exactly the opposite: the byte-oriented interface cannot represent
> all data; only the character-oriented API can. As a consequence,
> libraries and applications that want to support all user data in a
> cross-platform manner have to accept mish-mash of bytes and characters
> exactly in the way that caused endless troubles for Python 2.x.
> With this PEP, a uniform treatment of these data as characters becomes
> possible. The uniformity is achieved by using specific encoding
> algorithms, meaning that the data can be converted back to bytes on
> POSIX systems only if the same encoding is used.
> On Windows, Python uses the wide character APIs to access
> character-oriented APIs, allowing direct conversion of the
> environmental data to Python str objects.
> On POSIX systems, Python currently applies the locale's encoding to
> convert the byte data to Unicode. If the locale's encoding is UTF-8,
> it can represent the full set of Unicode characters, otherwise, only a
> subset is representable. In the latter case, using private-use
> characters to represent these bytes would be an option. For UTF-8,
> doing so would create an ambiguity, as the private-use characters may
> regularly occur in the input also.
> To convert non-decodable bytes, a new error handler "python-escape" is
> introduced, which decodes non-decodable bytes using into a private-use
> character U+F01xx, which is believed to not conflict with private-use
> characters that currently exist in Python codecs.
> The error handler interface is extended to allow the encode error
> handler to return byte strings immediately, in addition to returning
> Unicode strings which then get encoded again.
> If the locale's encoding is UTF-8, the file system encoding is set to
> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
> While providing a uniform API to non-decodable bytes, this interface
> has the limitation that chosen representation only "works" if the data
> get converted back to bytes with the python-escape error handler
> also. Encoding the data with the locale's encoding and the (default)
> strict error handler will raise an exception, encoding them with UTF-8
> will produce non-sensical data.
> For most applications, we assume that they eventually pass data
> received from a system interface back into the same system
> interfaces. For example, and application invoking os.listdir() will
> likely pass the result strings back into APIs like os.stat() or
> open(), which then encodes them back into their original byte
> representation. Applications that need to process the original byte
> strings can obtain them by encoding the character strings with the
> file system encoding, passing "python-escape" as the error handler
> This document has been placed in the public domain.
> Python-Dev mailing list
> Python-Dev at python.org
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/agbauer%40gmail.com
More information about the Python-Dev