[Python-Dev] PEP 393 Summer of Code Project
Terry Reedy
tjreedy at udel.edu
Wed Aug 24 22:37:21 CEST 2011
On 8/24/2011 1:45 PM, Victor Stinner wrote:
> Le 24/08/2011 02:46, Terry Reedy a écrit :
> I don't think that using UTF-16 with surrogate pairs is really a big
> problem. A lot of work has been done to hide this. For example,
> repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters.
> Ezio fixed recently str.is*() methods in Python 3.2+.
I greatly appreciate that he did. The * (lower,upper,title) methods
apparently are not fixed yet as the corresponding new tests are
currently skipped for narrow builds.
> For len(str): its a known problem, but if you really care of the number
> of *character* and not the number of UTF-16 units, it's easy to
> implement your own character_length() function. len(str) gives the
> UTF-16 units instead of the number of character for a simple reason:
> it's faster: O(1), whereas character_length() is O(n).
It is O(1) after a one-time O(n) preproccessing, which is the same time
order for creating the string in the first place.
Anyway, I think the most important deficiency is with iteration:
>>> from unicodedata import name
>>> name('\U0001043c')
'DESERET SMALL LETTER DEE'
>>> for c in 'abc\U0001043c':
print(name(c))
LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER C
Traceback (most recent call last):
File "<pyshell#9>", line 2, in <module>
print(name(c))
ValueError: no such name
This would work on wide builds but does not here (win7) because narrow
build iteration produces a naked non-character surrogate code unit that
has no specific entry in the Unicode Character Database.
I believe that most new people who read "Strings contain Unicode
characters." would expect string iteration to always produce the Unicode
characters that they put in the string. The extra time per char needed
to produce the surrogate pair that represents the character entered is
O(1).
>> utf16.py, attached to http://bugs.python.org/issue12729
>> prototypes a different solution than the PEP for the above problems for
>> the 'mostly BMP' case. I will discuss it in a different post.
>
> Yeah, you can workaround UTF-16 limits using O(n) algorithms.
I presented O(log(number of non-BMP chars)) algorithms for indexing and
slicing. For the mostly BMP case, that is hugely better than O(n).
> PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF)
> an all platforms with a small memory footprint and only O(1) functions.
For Windows users, I believe it will nearly double the memory footprint
if there are any non-BMP chars. On my new machine, I should not mind
that in exchange for correct behavior.
--
Terry Jan Reedy
More information about the Python-Dev
mailing list