[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

"Martin v. Löwis" martin at v.loewis.de
Wed Apr 22 15:07:47 EDT 2009


> "correct" -> "corrected"

Thanks, fixed.

>> To convert non-decodable bytes, a new error handler "python-escape" is
>> introduced, which decodes non-decodable bytes using into a private-use
>> character U+F01xx, which is believed to not conflict with private-use
>> characters that currently exist in Python codecs.
> 
> Would this mean that real private use characters in the file name would
> raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
> any error handler.

The python-escape codec is only used/meaningful if the env encoding
is not UTF-8. For any other encoding, it is assumed that no character
actually maps to the private-use characters.

>> The error handler interface is extended to allow the encode error
>> handler to return byte strings immediately, in addition to returning
>> Unicode strings which then get encoded again.
> 
> Then the error callback for encoding would become specific to the target
> encoding.

Why would it become specific? It can work the same way for any encoding:
take U+F01xx, and generate the byte xx.

>> If the locale's encoding is UTF-8, the file system encoding is set to
>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
> 
> Is this done by the codec, or the error handler? If it's done by the
> codec I don't see a reason for the "python-escape" error handler.

utf-8b is a new codec. However, the utf-8b codec is only used if the
env encoding would otherwise be utf-8. For utf-8b, the error handler
is indeed unnecessary.

>> While providing a uniform API to non-decodable bytes, this interface
>> has the limitation that chosen representation only "works" if the data
>> get converted back to bytes with the python-escape error handler
>> also.
> 
> I thought the error handler would be used for decoding.

It's used in both directions: for decoding, it converts \xXX to
U+F01XX. For encoding, U+F01XX will trigger an error, which is then
handled by the handler to produce \xXX.

> "and" -> "an"

Thanks, fixed.

Regards,
Martin




More information about the Python-list mailing list