[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman v+python at g.nevcal.com
Tue Apr 28 22:37:07 CEST 2009

On approximately 4/28/2009 1:25 PM, came the following characters from 
the keyboard of Martin v. Löwis:
>> The UTF-8b representation suffers from the same potential ambiguities as
>> the PUA characters... 
> Not at all the same ambiguities. Here, again, the two choices:
> A. use PUA characters to represent undecodable bytes, in particular for
>    UTF-8 (the PEP actually never proposed this to happen).
>    This introduces an ambiguity: two different files in the same
>    directory may decode to the same string name, if one has the PUA
>    character, and the other has a non-decodable byte that gets decoded
>    to the same PUA character.
> B. use UTF-8b, representing the byte will ill-formed surrogate codes.
>    The same ambiguity does *NOT* exist. If a file on disk already
>    contains an invalid surrogate code in its file name, then the UTF-8b
>    decoder will recognize this as invalid, and decode it byte-for-byte,
>    into three surrogate codes. Hence, the file names that are different
>    on disk are also different in memory. No ambiguity.

C. File on disk with the invalid surrogate code, accessed via the str 
interface, no decoding happens, matches in memory the file on disk with 
the byte that translates to the same surrogate, accessed via the bytes 
interface.  Ambiguity.

Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

More information about the Python-Dev mailing list