[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

"Martin v. Löwis" martin at v.loewis.de
Wed Apr 22 15:24:54 EDT 2009


MRAB wrote:
> Martin v. Löwis wrote:
> [snip]
>> To convert non-decodable bytes, a new error handler "python-escape" is
>> introduced, which decodes non-decodable bytes using into a private-use
>> character U+F01xx, which is believed to not conflict with private-use
>> characters that currently exist in Python codecs.
>>
>> The error handler interface is extended to allow the encode error
>> handler to return byte strings immediately, in addition to returning
>> Unicode strings which then get encoded again.
>>
>> If the locale's encoding is UTF-8, the file system encoding is set to
>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
>>
> If the byte stream happens to include a sequence which decodes to
> U+F01xx, shouldn't that raise an exception?

I apparently have not expressed it clearly, so please help me improve
the text. What I mean is this:

- if the environment encoding (for lack of better name) is UTF-8,
  Python stops using the utf-8 codec under this PEP, and switches
  to the utf-8b codec.
- otherwise (env encoding is not utf-8), undecodable bytes get decoded
  with the error handler. In this case, U+F01xx will not occur
  in the byte stream, since no other codec ever produces this PUA
  character (this is not fully true - UTF-16 may also produce PUA
  characters, but they can't appear as env encodings).
So the case you are referring to should not happen.

Regards,
Martin



More information about the Python-list mailing list