On 30.08.2014 01:37, Greg Ewing wrote:
M.-A. Lemburg wrote:
we needed a way to make sure that Python 3 also optionally supports working with lone surrogates in such UTF-8 streams (nowadays called CESU-8: http://en.wikipedia.org/wiki/CESU-8).
I don't think CESU-8 is the same thing. According to the wiki page, CESU-8 *requires* all code points above 0xffff to be split into surrogate pairs before encoding. It also doesn't say that lone surrogates are valid -- it doesn't mention lone surrogates at all, only pairs. Neither does the linked technical report.
The technical report also says that CESU-8 forbids any UTF-8 sequences of more than three bytes, so it's definitely not "UTF-8 plus lone surrogates".
You're right, it's not the same as UTF-8 plus lone surrogates.
CESU-8 does encode surrogates as individual code points using the UTF-8 encoding, which is what probably caused it to be mentioned in discussions when talking about having UTF-8 streams do the same for lone surrogates.
So let's call the encoding UTF-8-py so that everyone knows what we're talking about :-)