[Python-Dev] PEP 393 Summer of Code Project

Fri Aug 26 10:54:09 CEST 2011

Stefan Behnel wrote:
> Isaac Morland, 26.08.2011 04:28:
>> On Thu, 25 Aug 2011, Guido van Rossum wrote:
>>> I'm not sure what should happen with UTF-8 when it (in flagrant
>>> violation of the standard, I presume) contains two separately-encoded
>>> surrogates forming a valid surrogate pair; probably whatever the UTF-8
>>> codec does on a wide build today should be good enough. Similarly for
>>> encoding to UTF-8 on a wide build if one managed to create a string
>>> containing a surrogate pair. Basically, I'm for a
>>> garbage-in-garbage-out approach (with separate library functions to
>>> detect garbage if the app is worried about it).
>>
>> If it's called UTF-8, there is no decision to be taken as to decoder
>> behaviour - any byte sequence not permitted by the Unicode standard must
>> result in an error (although, of course, *how* the error is to be
>> reported
>> could legitimately be the subject of endless discussion). There are
>> security implications to violating the standard so this isn't just
>> legalistic purity.
>>
>> Hmmm, doesn't look good:
>>
>> Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
>> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> '\xed\xb0\x80'.decode ('utf-8')
>> u'\udc00'
>> >>>
>>
>> Incorrect! Although this is a narrow build - I can't say what the wide
>> build would do.
> 
> Works the same for me in a wide Py2.7 build, but gives me this in Py3:
> 
> Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50)
> [GCC 4.4.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> b'\xed\xb0\x80'.decode ('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
> illegal encoding
> 
> Same for current Py3.3 and the PEP393 build (although both have a better
> exception message now: "UnicodeDecodeError: 'utf8' codec can't decode
> bytes in position 0-1: invalid continuation byte").

The reason for this is that the UTF-8 codec in Python 2.x
has never rejected lone surrogates and it was used to
store Unicode literals in pyc files (using marshal)
and also by pickle for transferring Unicode strings,
so we could simply reject lone surrogates, since this
would have caused compatibility problems.

That change was made in Python 3.x by having a special
error handler surrogatepass which allows the UTF-8
codec to process lone surrogates as well.

BTW: I'd love to join the discussion about PEP 393, but
unfortunately I'm swamped with work, so these are just
a few comments...

What I'm missing in the discussion is statistics of the
effects of the patch (both memory and performance) and
the effect on 3rd party extensions.

I'm not convinced that the memory/speed tradeoff is worth the
breakage or whether the patch actually saves memory in real world
applications and I'm unsure whether the needed code changes to
the binary Python Unicode API can be done in a minor Python
release.

Note that in the worst case, a PEP 393 Unicode object will
save three versions of the same string, e.g. on Windows
with sizeof(wchar_t)==2: A UCS4 version in str,
a UTF-8 version in utf8 (this gets build whenever Python needs
a UTF-8 version of the Object) and a wchar_t version in wstr
(which gets build whenever Python codecs or extensions need
Py_UNICODE or a wchar_t representation).
On all platforms, in the case where you store a Latin-1
non-ASCII string: str holds the Latin-1 string, utf8 the
UTF-8 version and wstr the 2- or 4-bytes wchar_t version.

* A note on terminology: Python stores Unicode as code points.

A Unicode "code point" refers to any value in the Unicode code
range which is 0 - 0x10FFFF. Lone surrogates, unassigned
and illegal code points are all still code points - this is
a detail people often forget. Various code points in Unicode
have special meanings and some are not allowed to be
used in encodings, but that does not make them rule them
out from being stored and processed as code points.

Code units are only used in encoded versions Unicode, e.g.
the UTF-8, -16, -32. Mixing code units and code points
can cause much confusion, so it's better to talk only
about code point when referring to Python Unicode objects,
since you only ever meet code units when looking at the
the bytes output of the codecs.

This is important to know, since Python is not only meant
to process Unicode, but also to build Unicode strings, so
a careful distinction has to be made when considering what
is correct and what not: codecs have to follow much more
strict rules than Python itself.

* A note on surrogates: These are just one particular problem
where you run into the situation where splitting a Unicode
string potentially breaks a combination of code points.
There are a few other types of code points that cause similar
problems, e.g. combining code points.

Simply going with UCS-4 does not solve the problem, since
even with UCS-4 storage, you can still have surrogates in your
Python Unicode string. As with many things, it is important
to be aware of the potential problem, but there's no
automatic fix to get rid of it. What we can do, is make
the best of it and this has happened already in many areas,
e.g. codecs joining surrogates automatically, chr()
creating surrogates, etc.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 26 2011)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2011-10-04: PyCon DE 2011, Leipzig, Germany                39 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/