[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Fri Apr 24 03:38:49 CEST 2009

Martin v. Löwis wrote:
> MRAB wrote:
>> Martin v. Löwis wrote:
>> [snip]
>>> To convert non-decodable bytes, a new error handler "python-escape" is
>>> introduced, which decodes non-decodable bytes using into a private-use
>>> character U+F01xx, which is believed to not conflict with private-use
>>> characters that currently exist in Python codecs.
>>>
>>> The error handler interface is extended to allow the encode error
>>> handler to return byte strings immediately, in addition to returning
>>> Unicode strings which then get encoded again.
>>>
>>> If the locale's encoding is UTF-8, the file system encoding is set to
>>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
>>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
>>>
>> If the byte stream happens to include a sequence which decodes to
>> U+F01xx, shouldn't that raise an exception?
> 
> I apparently have not expressed it clearly, so please help me improve
> the text. What I mean is this:
> 
> - if the environment encoding (for lack of better name) is UTF-8,
>   Python stops using the utf-8 codec under this PEP, and switches
>   to the utf-8b codec.
> - otherwise (env encoding is not utf-8), undecodable bytes get decoded
>   with the error handler. In this case, U+F01xx will not occur
>   in the byte stream, since no other codec ever produces this PUA
>   character (this is not fully true - UTF-16 may also produce PUA
>   characters, but they can't appear as env encodings).
> So the case you are referring to should not happen.
> 
I think what's confusing me is that you talk about mapping non-decodable
bytes to U+F01xx, but you also talk about decoding to half surrogate
codes U+DC80..U+DCFF.

If the bytes are mapped to single half surrogate codes instead of the
normal pairs (low+high), then I can see that decoding could never be
ambiguous and encoding could produce the original bytes.