[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
"Martin v. Löwis"
martin at v.loewis.de
Wed Apr 22 21:07:47 CEST 2009
> "correct" -> "corrected"
Thanks, fixed.
>> To convert non-decodable bytes, a new error handler "python-escape" is
>> introduced, which decodes non-decodable bytes using into a private-use
>> character U+F01xx, which is believed to not conflict with private-use
>> characters that currently exist in Python codecs.
>
> Would this mean that real private use characters in the file name would
> raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
> any error handler.
The python-escape codec is only used/meaningful if the env encoding
is not UTF-8. For any other encoding, it is assumed that no character
actually maps to the private-use characters.
>> The error handler interface is extended to allow the encode error
>> handler to return byte strings immediately, in addition to returning
>> Unicode strings which then get encoded again.
>
> Then the error callback for encoding would become specific to the target
> encoding.
Why would it become specific? It can work the same way for any encoding:
take U+F01xx, and generate the byte xx.
>> If the locale's encoding is UTF-8, the file system encoding is set to
>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
>
> Is this done by the codec, or the error handler? If it's done by the
> codec I don't see a reason for the "python-escape" error handler.
utf-8b is a new codec. However, the utf-8b codec is only used if the
env encoding would otherwise be utf-8. For utf-8b, the error handler
is indeed unnecessary.
>> While providing a uniform API to non-decodable bytes, this interface
>> has the limitation that chosen representation only "works" if the data
>> get converted back to bytes with the python-escape error handler
>> also.
>
> I thought the error handler would be used for decoding.
It's used in both directions: for decoding, it converts \xXX to
U+F01XX. For encoding, U+F01XX will trigger an error, which is then
handled by the handler to produce \xXX.
> "and" -> "an"
Thanks, fixed.
Regards,
Martin
More information about the Python-Dev
mailing list