[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
MRAB
google at mrabarnett.plus.com
Sat Apr 25 16:21:13 CEST 2009
Martin v. Löwis wrote:
>> If the bytes are mapped to single half surrogate codes instead of the
>> normal pairs (low+high), then I can see that decoding could never be
>> ambiguous and encoding could produce the original bytes.
>
> I was confused by Markus Kuhn's original UTF-8b specification. I have
> now changed the PEP to avoid using PUA characters at all.
>
I find the PEP easier to understand now.
In detail I'd say that if a sequence of bytes >=0x80 is found which is
not valid UTF-8, then the first byte is mapped to a half surrogate and
then decoding is continued from the next byte.
The only drawback I can see is if the UTF-8 bytes actually decode to a
half surrogate. However, half surrogates should really only occur in
UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8
anyway!
As for handling this case, you could either:
1. Raise an exception (which is what you're trying to avoid)
or:
2. Treat it as invalid UTF-8 and map the bytes to half surrogates
(encoding would produce the original bytes).
I'd prefer option 2.
Anyway, +1 from me.
More information about the Python-Dev
mailing list