[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
l.mastrodomenico at gmail.com
Tue Apr 28 15:01:32 CEST 2009
2009/4/28 Glenn Linderman <v+python at g.nevcal.com>:
> The switch from PUA to half-surrogates does not resolve the issues with the
> encoding not being a 1-to-1 mapping, though. The very fact that you think
> you can get away with use of lone surrogates means that other people might,
> accidentally or intentionally, also use lone surrogates for some other
> purpose. Even in file names.
It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
not a valid Unicode character (not a character at all, really) and the
only way you can put this in a POSIX filename is if you use a very
lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'.
Since this byte sequence doesn't represent a valid character when
decoded with UTF-8, it should simply be considered an invalid UTF-8
sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
Martin: maybe the PEP should say this explicitly?
Note that the round-trip works without ambiguities between '\udcff' in
b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf'
and b'\xff' in the filename, decoded by Python to '\udcff':
b'\xff' -> '\udcff' -> b'\xff'
More information about the Python-Dev