Re: [Python-Dev] PEP 393 Summer of Code Project

Aug. 24, 2011

      On 8/24/2011 1:45 PM, Victor Stinner wrote:
...
Le 24/08/2011 02:46, Terry Reedy a écrit :
...
I don't think that using UTF-16 with surrogate pairs is really a big
problem. A lot of work has been done to hide this. For example,
repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters.
Ezio fixed recently str.is*() methods in Python 3.2+.
I greatly appreciate that he did. The * (lower,upper,title) methods 
apparently are not fixed yet as the corresponding new tests are 
currently skipped for narrow builds.
...
For len(str): its a known problem, but if you really care of the number
of *character* and not the number of UTF-16 units, it's easy to
implement your own character_length() function. len(str) gives the
UTF-16 units instead of the number of character for a simple reason:
it's faster: O(1), whereas character_length() is O(n).
It is O(1) after a one-time O(n) preproccessing, which is the same  time 
order for creating the string in the first place.

Anyway, I think the most important deficiency is with iteration:
...
...
...
from unicodedata import name
name('\U0001043c')
'DESERET SMALL LETTER DEE'
for c in 'abc\U0001043c':
   print(name(c))
LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER C
Traceback (most recent call last):
   File "<pyshell#9>", line 2, in <module>
     print(name(c))
ValueError: no such name

This would work on wide builds but does not here (win7) because narrow 
build iteration produces a naked non-character surrogate code unit that 
has no specific entry in the Unicode Character Database.

I believe that most new people who read "Strings contain Unicode 
characters." would expect string iteration to always produce the Unicode 
characters that they put in the string. The extra time per char needed 
to produce the surrogate pair that represents the character entered is 
O(1).
...
...
utf16.py, attached to http://bugs.python.org/issue12729
prototypes a different solution than the PEP for the above problems for
the 'mostly BMP' case. I will discuss it in a different post.
Yeah, you can workaround UTF-16 limits using O(n) algorithms.
I presented O(log(number of non-BMP chars)) algorithms for indexing and 
slicing. For the mostly BMP case, that is hugely better than O(n).
...
PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF)
an all platforms with a small memory footprint and only O(1) functions.
For Windows users, I believe it will nearly double the memory footprint 
if there are any non-BMP chars. On my new machine, I should not mind 
that in exchange for correct behavior.

-- 
Terry Jan Reedy