
Le 24/08/2011 02:46, Terry Reedy a écrit :
On 8/23/2011 9:21 AM, Victor Stinner wrote:
Le 23/08/2011 15:06, "Martin v. Löwis" a écrit :
Well, things have to be done in order: 1. the PEP needs to be approved 2. the performance bottlenecks need to be identified 3. optimizations should be applied.
I would not vote for the PEP if it slows down Python, especially if it's much slower. But Torsten says that it speeds up Python, which is surprising. I have to do my own benchmarks :-)
The current UCS2 Unicode string implementation, by design, quickly gives WRONG answers for len(), iteration, indexing, and slicing if a string contains any non-BMP (surrogate pair) Unicode characters. That may have been excusable when there essentially were no such extended chars, and the few there were were almost never used. But now there are many more, with more being added to each Unicode edition. They include cursive Math letters that are used in English documents today. The problem will slowly get worse and Python, at least on Windows, will become a language to avoid for dependable Unicode document processing. 3.x needs a proper Unicode implementation that works for all strings on all builds.
I don't think that using UTF-16 with surrogate pairs is really a big problem. A lot of work has been done to hide this. For example, repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters. Ezio fixed recently str.is*() methods in Python 3.2+. For len(str): its a known problem, but if you really care of the number of *character* and not the number of UTF-16 units, it's easy to implement your own character_length() function. len(str) gives the UTF-16 units instead of the number of character for a simple reason: it's faster: O(1), whereas character_length() is O(n).
utf16.py, attached to http://bugs.python.org/issue12729 prototypes a different solution than the PEP for the above problems for the 'mostly BMP' case. I will discuss it in a different post.
Yeah, you can workaround UTF-16 limits using O(n) algorithms. PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF) an all platforms with a small memory footprint and only O(1) functions. Note: Java and the Qt library use also UTF-16 strings and have exactly the same "limitations" for str[n] and len(str). Victor