
On 8/24/2011 1:45 PM, Victor Stinner wrote:
Le 24/08/2011 02:46, Terry Reedy a écrit :
I don't think that using UTF-16 with surrogate pairs is really a big problem. A lot of work has been done to hide this. For example, repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters. Ezio fixed recently str.is*() methods in Python 3.2+.
I greatly appreciate that he did. The * (lower,upper,title) methods apparently are not fixed yet as the corresponding new tests are currently skipped for narrow builds.
For len(str): its a known problem, but if you really care of the number of *character* and not the number of UTF-16 units, it's easy to implement your own character_length() function. len(str) gives the UTF-16 units instead of the number of character for a simple reason: it's faster: O(1), whereas character_length() is O(n).
It is O(1) after a one-time O(n) preproccessing, which is the same time order for creating the string in the first place. Anyway, I think the most important deficiency is with iteration:
from unicodedata import name name('\U0001043c') 'DESERET SMALL LETTER DEE' for c in 'abc\U0001043c': print(name(c))
LATIN SMALL LETTER A LATIN SMALL LETTER B LATIN SMALL LETTER C Traceback (most recent call last): File "<pyshell#9>", line 2, in <module> print(name(c)) ValueError: no such name This would work on wide builds but does not here (win7) because narrow build iteration produces a naked non-character surrogate code unit that has no specific entry in the Unicode Character Database. I believe that most new people who read "Strings contain Unicode characters." would expect string iteration to always produce the Unicode characters that they put in the string. The extra time per char needed to produce the surrogate pair that represents the character entered is O(1).
utf16.py, attached to http://bugs.python.org/issue12729 prototypes a different solution than the PEP for the above problems for the 'mostly BMP' case. I will discuss it in a different post.
Yeah, you can workaround UTF-16 limits using O(n) algorithms.
I presented O(log(number of non-BMP chars)) algorithms for indexing and slicing. For the mostly BMP case, that is hugely better than O(n).
PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF) an all platforms with a small memory footprint and only O(1) functions.
For Windows users, I believe it will nearly double the memory footprint if there are any non-BMP chars. On my new machine, I should not mind that in exchange for correct behavior. -- Terry Jan Reedy