Re: [Python-Dev] PEP 383 (again)
Lino Mastrodomenico wrote:
Let's suppose that I use Python 2.x or something else to create a file with name b'\xff'. My (Linux) system has a sane configuration and the filesystem encoding is UTF-8, so it's an invalid name but the kernel will blindly accept it anyway.
With this PEP, Python 3.1 listdir() will convert b'\xff' to the string '\udcff'.
One question that really bothers me about this proposal is the following: Assume a UTF-8 locale. A file named b'\xff', being an invalid UTF-8 sequence, will be converted to the half-surrogate '\udcff'. However, a file named b'\xed\xb3\xbf', a valid[1] UTF-8 sequence, will also be converted to '\udcff'. Those are quite different POSIX pathnames; how will Python know which one it was when I later pass '\udcff' to open()? A poster hinted at this question, but I haven't seen it answered, yet. [1] I'm assuming that it's valid UTF8 because it passes through Python 2.5's '\xed\xb3\xbf'.decode('utf-8'). I don't claim to be a UTF-8 expert.
Hrvoje Niksic
Assume a UTF-8 locale. A file named b'\xff', being an invalid UTF-8 sequence, will be converted to the half-surrogate '\udcff'. However, a file named b'\xed\xb3\xbf', a valid[1] UTF-8 sequence, will also be converted to '\udcff'. Those are quite different POSIX pathnames; how will Python know which one it was when I later pass '\udcff' to open()?
[1] I'm assuming that it's valid UTF8 because it passes through Python 2.5's '\xed\xb3\xbf'.decode('utf-8'). I don't claim to be a UTF-8 expert.
I'm not a UTF-8 expert either, but I got bitten by this yesterday. I was uploading a file to a Google Search Appliance and it was rejected as invalid UTF-8 despite having been encoded into UTF-8 by Python. The cause was a byte sequence which decoded to a half surrogate similar to your example above. Python will happily decode and encode such sequences, but as I found to my cost other systems reject them. Reading wikipedia implies that Python is wrong to accept these sequences and I think (though I'm not a lawyer) that RFC 3629 also implies this: "The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters." and "Implementations of the decoding algorithm above MUST protect against decoding invalid sequences."
participants (2)
-
Duncan Booth
-
Hrvoje Niksic