[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Wed Apr 22 22:06:49 CEST 2009

Martin v. Löwis wrote:
>> "correct" -> "corrected"
> 
> Thanks, fixed.
> 
>>> To convert non-decodable bytes, a new error handler "python-escape" is
>>> introduced, which decodes non-decodable bytes using into a private-use
>>> character U+F01xx, which is believed to not conflict with private-use
>>> characters that currently exist in Python codecs.
>> Would this mean that real private use characters in the file name would
>> raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
>> any error handler.
> 
> The python-escape codec is only used/meaningful if the env encoding
> is not UTF-8. For any other encoding, it is assumed that no character
> actually maps to the private-use characters.

Which should be true for any encoding from the pre-unicode era, but not
for UTF-16/32 and variants.

>>> The error handler interface is extended to allow the encode error
>>> handler to return byte strings immediately, in addition to returning
>>> Unicode strings which then get encoded again.
>> Then the error callback for encoding would become specific to the target
>> encoding.
> 
> Why would it become specific? It can work the same way for any encoding:
> take U+F01xx, and generate the byte xx.

If any error callback emits bytes these byte sequences must be legal in
the target encoding, which depends on the target encoding itself.

However for the normal use of this error handler this might be
irrelevant, because those filenames that get encoded were constructed in
such a way that reencoding them regenerates the original byte sequence.

>>> If the locale's encoding is UTF-8, the file system encoding is set to
>>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
>>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
>> Is this done by the codec, or the error handler? If it's done by the
>> codec I don't see a reason for the "python-escape" error handler.
> 
> utf-8b is a new codec. However, the utf-8b codec is only used if the
> env encoding would otherwise be utf-8. For utf-8b, the error handler
> is indeed unnecessary.

Wouldn't it make more sense to be consistent how non-decodable bytes get
decoded? I.e. should the utf-8b codec decode those bytes to PUA
characters too (and refuse to encode then, so the error handler outputs
them)?

>>> While providing a uniform API to non-decodable bytes, this interface
>>> has the limitation that chosen representation only "works" if the data
>>> get converted back to bytes with the python-escape error handler
>>> also.
>> I thought the error handler would be used for decoding.
> 
> It's used in both directions: for decoding, it converts \xXX to
> U+F01XX. For encoding, U+F01XX will trigger an error, which is then
> handled by the handler to produce \xXX.

But only for non-UTF8 encodings?

Servus,
   Walter