Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

25 Apr 2009


      ...
The only drawback I can see is if the UTF-8 bytes actually decode to a
half surrogate. However, half surrogates should really only occur in
UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8
anyway!
Right: that's the rationale for UTF-8b. Encoding half surrogates
violates parts of the Unicode spec, so UTF-8b is "safe".
...
As for handling this case, you could either:
1. Raise an exception (which is what you're trying to avoid)
or:
2. Treat it as invalid UTF-8 and map the bytes to half surrogates
(encoding would produce the original bytes).
I'd prefer option 2.
I hadn't thought of this case, but you are right - they *are*
illegal bytes, after all. Raising an exception would be useless
since the whole point of this codec is to never raise unicode
errors.

Regards,
Martin

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

"Martin v. Löwis"