[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Fri Apr 24 09:59:03 CEST 2009

On Wed, Apr 22, 2009 at 8:50 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> For Python 3, one proposed solution is to provide two sets of APIs: a
> byte-oriented one, and a character-oriented one, where the
> character-oriented one would be limited to not being able to represent
> all data accurately. Unfortunately, for Windows, the situation would
> be exactly the opposite: the byte-oriented interface cannot represent
> all data; only the character-oriented API can. As a consequence,
> libraries and applications that want to support all user data in a
> cross-platform manner have to accept mish-mash of bytes and characters
> exactly in the way that caused endless troubles for Python 2.x.

Is the second part of this actually true? My understanding may be
flawed, but surely all Unicode data can be converted to and from bytes
using UTF-8? Obviously not all byte sequences are valid UTF-8, but
this doesn't prevent one from creating an arbitrary Unicode string
using "utf-8 bytes".decode("utf-8").  Given this, can't people who
must have access to all files / environment data just use the bytes
interface?

Disclosure: My gut reaction is that the solution described in the PEP
is a hack, but I'm hardly a character encoding expert.  My feeling is
that the correct solution is to either standardise on the bytes
interface as the lowest common denominator, or to add a Path type (and
I guess an EnvironmentalData type) and use the new type to attempt to
hide the differences.

Schiavo
Simon