[Python-Dev] PEP 383 (again)
Hrvoje Niksic
hrvoje.niksic at avl.com
Tue Apr 28 14:41:19 CEST 2009
Lino Mastrodomenico wrote:
> Let's suppose that I use Python 2.x or something else to create a file
> with name b'\xff'. My (Linux) system has a sane configuration and the
> filesystem encoding is UTF-8, so it's an invalid name but the kernel
> will blindly accept it anyway.
>
> With this PEP, Python 3.1 listdir() will convert b'\xff' to the string '\udcff'.
One question that really bothers me about this proposal is the following:
Assume a UTF-8 locale. A file named b'\xff', being an invalid UTF-8
sequence, will be converted to the half-surrogate '\udcff'. However, a
file named b'\xed\xb3\xbf', a valid[1] UTF-8 sequence, will also be
converted to '\udcff'. Those are quite different POSIX pathnames; how
will Python know which one it was when I later pass '\udcff' to open()?
A poster hinted at this question, but I haven't seen it answered, yet.
[1]
I'm assuming that it's valid UTF8 because it passes through Python 2.5's
'\xed\xb3\xbf'.decode('utf-8'). I don't claim to be a UTF-8 expert.
More information about the Python-Dev
mailing list