[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
"Martin v. Löwis"
martin at v.loewis.de
Wed Apr 22 21:24:54 CEST 2009
> Martin v. Löwis wrote:
>> To convert non-decodable bytes, a new error handler "python-escape" is
>> introduced, which decodes non-decodable bytes using into a private-use
>> character U+F01xx, which is believed to not conflict with private-use
>> characters that currently exist in Python codecs.
>> The error handler interface is extended to allow the encode error
>> handler to return byte strings immediately, in addition to returning
>> Unicode strings which then get encoded again.
>> If the locale's encoding is UTF-8, the file system encoding is set to
>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
> If the byte stream happens to include a sequence which decodes to
> U+F01xx, shouldn't that raise an exception?
I apparently have not expressed it clearly, so please help me improve
the text. What I mean is this:
- if the environment encoding (for lack of better name) is UTF-8,
Python stops using the utf-8 codec under this PEP, and switches
to the utf-8b codec.
- otherwise (env encoding is not utf-8), undecodable bytes get decoded
with the error handler. In this case, U+F01xx will not occur
in the byte stream, since no other codec ever produces this PUA
character (this is not fully true - UTF-16 may also produce PUA
characters, but they can't appear as env encodings).
So the case you are referring to should not happen.
More information about the Python-list