[Python-Dev] PEP 383 (again)
Duncan Booth
duncan.booth at suttoncourtenay.org.uk
Tue Apr 28 15:22:45 CEST 2009
Hrvoje Niksic <hrvoje.niksic at avl.com> wrote:
> Assume a UTF-8 locale. A file named b'\xff', being an invalid UTF-8
> sequence, will be converted to the half-surrogate '\udcff'. However,
> a file named b'\xed\xb3\xbf', a valid[1] UTF-8 sequence, will also be
> converted to '\udcff'. Those are quite different POSIX pathnames; how
> will Python know which one it was when I later pass '\udcff' to
> open()?
>
>
> [1]
> I'm assuming that it's valid UTF8 because it passes through Python
> 2.5's '\xed\xb3\xbf'.decode('utf-8'). I don't claim to be a UTF-8
> expert.
I'm not a UTF-8 expert either, but I got bitten by this yesterday. I was
uploading a file to a Google Search Appliance and it was rejected as
invalid UTF-8 despite having been encoded into UTF-8 by Python.
The cause was a byte sequence which decoded to a half surrogate similar to
your example above. Python will happily decode and encode such sequences,
but as I found to my cost other systems reject them.
Reading wikipedia implies that Python is wrong to accept these sequences
and I think (though I'm not a lawyer) that RFC 3629 also implies this:
"The definition of UTF-8 prohibits encoding character numbers between
U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form
(as surrogate pairs) and do not directly represent characters."
and
"Implementations of the decoding algorithm above MUST protect against
decoding invalid sequences."
More information about the Python-Dev
mailing list