[Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...]

M.-A. Lemburg mal at egenix.com
Sat Aug 30 12:03:06 CEST 2014


On 30.08.2014 01:37, Greg Ewing wrote:
> M.-A. Lemburg wrote:
>> we needed
>> a way to make sure that Python 3 also optionally supports working
>> with lone surrogates in such UTF-8 streams (nowadays called CESU-8:
>> http://en.wikipedia.org/wiki/CESU-8).
> 
> I don't think CESU-8 is the same thing. According to the wiki
> page, CESU-8 *requires* all code points above 0xffff to be split
> into surrogate pairs before encoding. It also doesn't say that
> lone surrogates are valid -- it doesn't mention lone surrogates
> at all, only pairs. Neither does the linked technical report.
> 
> The technical report also says that CESU-8 forbids any UTF-8
> sequences of more than three bytes, so it's definitely not
> "UTF-8 plus lone surrogates".

You're right, it's not the same as UTF-8 plus lone surrogates.

CESU-8 does encode surrogates as individual code points using
the UTF-8 encoding, which is what probably caused it to be
mentioned in discussions when talking about having UTF-8 streams
do the same for lone surrogates.

So let's call the encoding UTF-8-py so that everyone knows
what we're talking about :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 30 2014)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2014-08-27: Released eGenix PyRun 2.0.1 ...       http://egenix.com/go62
2014-09-19: PyCon UK 2014, Coventry, UK ...                20 days to go
2014-09-27: PyDDF Sprint 2014 ...                          28 days to go

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Python-Dev mailing list