[Python-Dev] PEP 383 (again)

Tue Apr 28 15:22:45 CEST 2009

Hrvoje Niksic <hrvoje.niksic at avl.com> wrote:

> Assume a UTF-8 locale.  A file named b'\xff', being an invalid UTF-8 
> sequence, will be converted to the half-surrogate '\udcff'.  However,
> a file named b'\xed\xb3\xbf', a valid[1] UTF-8 sequence, will also be 
> converted to '\udcff'.  Those are quite different POSIX pathnames; how
> will Python know which one it was when I later pass '\udcff' to
> open()? 
> 
> 
> [1]
> I'm assuming that it's valid UTF8 because it passes through Python
> 2.5's '\xed\xb3\xbf'.decode('utf-8').  I don't claim to be a UTF-8
> expert.

I'm not a UTF-8 expert either, but I got bitten by this yesterday. I was 
uploading a file to a Google Search Appliance and it was rejected as 
invalid UTF-8 despite having been encoded into UTF-8 by Python.

The cause was a byte sequence which decoded to a half surrogate similar to 
your example above. Python will happily decode and encode such sequences, 
but as I found to my cost other systems reject them.

Reading wikipedia implies that Python is wrong to accept these sequences 
and I think (though I'm not a lawyer) that RFC 3629 also implies this:

"The definition of UTF-8 prohibits encoding character numbers between 
U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form 
(as surrogate pairs) and do not directly represent characters."

 and

"Implementations of the decoding algorithm above MUST protect against 
decoding invalid sequences."