[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 24 19:45:27 CEST 2011

Le 24/08/2011 02:46, Terry Reedy a écrit :
> On 8/23/2011 9:21 AM, Victor Stinner wrote:
>> Le 23/08/2011 15:06, "Martin v. Löwis" a écrit :
>>> Well, things have to be done in order:
>>> 1. the PEP needs to be approved
>>> 2. the performance bottlenecks need to be identified
>>> 3. optimizations should be applied.
>>
>> I would not vote for the PEP if it slows down Python, especially if it's
>> much slower. But Torsten says that it speeds up Python, which is
>> surprising. I have to do my own benchmarks :-)
>
> The current UCS2 Unicode string implementation, by design, quickly gives
> WRONG answers for len(), iteration, indexing, and slicing if a string
> contains any non-BMP (surrogate pair) Unicode characters. That may have
> been excusable when there essentially were no such extended chars, and
> the few there were were almost never used. But now there are many more,
> with more being added to each Unicode edition. They include cursive Math
> letters that are used in English documents today. The problem will
> slowly get worse and Python, at least on Windows, will become a language
> to avoid for dependable Unicode document processing. 3.x needs a proper
> Unicode implementation that works for all strings on all builds.

I don't think that using UTF-16 with surrogate pairs is really a big 
problem. A lot of work has been done to hide this. For example, 
repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters. 
Ezio fixed recently str.is*() methods in Python 3.2+.

For len(str): its a known problem, but if you really care of the number 
of *character* and not the number of UTF-16 units, it's easy to 
implement your own character_length() function. len(str) gives the 
UTF-16 units instead of the number of character for a simple reason: 
it's faster: O(1), whereas character_length() is O(n).

> utf16.py, attached to http://bugs.python.org/issue12729
> prototypes a different solution than the PEP for the above problems for
> the 'mostly BMP' case. I will discuss it in a different post.

Yeah, you can workaround UTF-16 limits using O(n) algorithms.

PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF) 
an all platforms with a small memory footprint and only O(1) functions.

Note: Java and the Qt library use also UTF-16 strings and have exactly 
the same "limitations" for str[n] and len(str).

Victor