[Python-Dev] len(chr(i)) = 2?

Tue Nov 23 20:31:37 CET 2010

Alexander Belopolsky wrote:
> On Mon, Nov 22, 2010 at 1:13 PM, Raymond Hettinger
> <raymond.hettinger at gmail.com> wrote:
> ..
>> Any explanation we give users needs to let them know two things:
>> * that we cover the entire range of unicode not just BMP
>> * that sometimes len(chr(i)) is one and sometimes two
> 
> This discussion motivated me to start looking into how well Python
> library itself is prepared to deal with len(chr(i)) = 2.  I was not
> surprised to find that textwrap does not handle the issue that well:
> 
>>>> len(wrap(' \U00010140' * 80, 20))
> 12
>>>> len(wrap(' \U00000140' * 80, 20))
> 8
> 
> That module should probably be rewritten to properly implement  the
> Unicode line breaking algorithm
> <http://unicode.org/reports/tr14/tr14-22.html>.
> 
> Yet finding a bug in a str object method after a 5 min review was a
> bit discouraging:
> 
>>>> 'xyz'.center(20, '\U00010140')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> TypeError: The fill character must be exactly one character long
> 
> Given the apparent difficulty of writing even basic text processing
> algorithms in presence of surrogate pairs, I wonder how wise it is to
> expose Python users to them. 

What's the alternative ?

Without surrogates, Python users with UCS-2 build (e.g. the Windows
Python users) would not be allowed to play with non-BMP code points.

IMHO, it's better to fix the stdlib. This is a long process, as you
can see with the Python3 stdlib evolution, but Python will eventually
get there.

> As Wikipedia explains, [1]
> 
> """
> Because the most commonly used characters are all in the Basic
> Multilingual Plane, converting between surrogate pairs and the
> original values is often not tested thoroughly. This leads to
> persistent bugs, and potential security holes, even in popular and
> well-reviewed application software.
> """
> 
> Since UCS-2 (the Character Encoding Form (CEF)) is now defined [1] to
> cover only BMP, maybe rather than changing the terms used in the
> reference manual, we should tighten the code to conform to the updated
> standards?

Can we please stop turning this around over and over again :-)
UCS-2 has never supported anything other than the BMP. However,
you can interpret sequences of UCS-2 code unit as UTF-16 and
then get access to the full Unicode character set. We've been
doing this in codecs ever since UCS-4 builds were introduced
some 8-9 years ago.

The change to have chr(i) return surrogates on UCS-2 builds
was perhaps done too early, but then, without such changes you'd
never notice that your code doesn't work well with surrogates.
It's just one piece of the puzzle when going from 8-bit strings
to Unicode.

> Again, given that the str object itself has at least one non-BMP
> character bug as we are closing on the third major release of py3k,
> how likely are 3rd party developers to get their libraries right as
> they port to 3.x?
> 
> [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2
> [2] http://unicode.org/reports/tr17/#CharacterEncodingForm

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 23 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/