[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman v+python at g.nevcal.com
Tue Apr 28 22:34:21 CEST 2009

On approximately 4/28/2009 6:01 AM, came the following characters from 
the keyboard of Lino Mastrodomenico:
> 2009/4/28 Glenn Linderman <v+python at g.nevcal.com>:
>> The switch from PUA to half-surrogates does not resolve the issues with the
>> encoding not being a 1-to-1 mapping, though.  The very fact that you  think
>> you can get away with use of lone surrogates means that other people might,
>> accidentally or intentionally, also use lone surrogates for some other
>> purpose.  Even in file names.
> It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
> not a valid Unicode character (not a character at all, really) and the
> only way you can put this in a POSIX filename is if you use a very
> lenient  UTF-8 encoder that gives you b'\xed\xb3\xbf'.


An 8859-1 locale allows any byte sequence to placed into a POSIX filename.

And while U+DCFF is illegal alone in Unicode, it is not illegal in 
Python str values.  And from my testing, Python 3's current UTF-8 
encoder will happily provide exactly the bytes value you mention when 
given U+DCFF.

> Since this byte sequence doesn't represent a valid character when
> decoded with UTF-8, it should simply be considered an invalid UTF-8
> sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
> '\udcff').
> Martin: maybe the PEP should say this explicitly?
> Note that the round-trip works without ambiguities between '\udcff' in
> the filename:
> b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf'
> and b'\xff' in the filename, decoded by Python to '\udcff':
> b'\xff' -> '\udcff' -> b'\xff'

Others have made this suggestion, and it is helpful to the PEP, but not 
sufficient.  As implemented as an error handler, I'm not sure that the 
b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 
decoder is happy with it.  Which, in my testing, it is.

Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

More information about the Python-Dev mailing list