[Python-Dev] UCS2/UCS4 default

Thu Jul 3 18:45:39 CEST 2008

On Jul 3, 2008, at 10:46 AM, Jeroen Ruigrok van der Werven wrote:

> -On [20080703 15:58], Guido van Rossum (guido at python.org) wrote:
>> Your seem to be suggesting that len(u"\U00012345") should return 1 on
>> a system that internally uses UTF-16 and hence represents this string
>> as a surrogate pair.
>
> From a Unicode and UTF-16 point of view that makes the most sense.  
> So yes, I
> am suggesting that.

I think this is misguided.

IMO, basically every programming language gets string handling wrong.  
(maybe with the exception of the unreleased perl6? it had some  
interesting moves in this area, but I haven't really been paying  
attention.) Everyone treats strings as arrays, but they are used quite  
differently. For a string, there is hardly ever a time when a  
programmer needs to index it with an arbitrary offset in number of  
codepoints, and the length-in-codepoints is pretty non-useful as well.  
Constant-time access to arbitrary codepoints in a string is pretty  
much unimportant. What *is* of utmost importantance is constant-time  
access to previously-returned points in the string.

I'd like to have 3 levels of access available:
1) "byte"-level. In a new implementation I'd probably choose to make  
all my strings stored in UTF-8, but UTF-16 is fine too.
2) codepoint-level.
3) grapheme-level.

You should be able to iterate over the string at any of the levels,  
ask for the nearest codepoint/grapheme boundary to the left or right  
of an index at a different level, etc.

Python could probably still be made to work kinda like this. I think a  
language designed as such in the first place could be nicer, with  
opaque index objects into the string rather than integers, and such,  
but...whatever.

Let's assume python is changed to always store strings in UTF-16.

All it would take is adding a few more functions to the str object to  
operate on the higher levels. Wherever I say "pos" I mean an integer  
index into the string, at the UTF-16 level. That may sometimes be  
unaligned with the boundary of the representation you're asking about,  
and behavior in that case needs to be specified as well.

.nextcodepoint(curpos, how_many=1) -> returns an index into the string  
how_many codepoints to the right (or left if negative) of the index  
curpos.

.nextgrapheme(curpos, how_many=1) -> returns an index into the string  
how_many graphemes to the right (or left if negative) of the index  
curpos.

.codepoints(from_pos=0, to_pos=None) -> return an iterator of  
codepoints from 'from_pos' to 'to_pos'. I think codepoints could be  
represented as strings themselves (so usually one character, sometimes  
two character strings).

.graphemes(from_pos=0, to_pos=None) -> return an iterator of graphemes  
from 'from_pos' to 'to_pos'. Also could be represented by strings. The  
returned graphemes should probably be normalized.

There are a few more desirable operations, to manipulate strings at  
the grapheme level (because unlike for UTF-8/UTF-16 codepoints,  
graphemes don't have the nice property of not containing prefixes  
which are themselves valid graphemes). So, you want a find (and  
everything else that implicitly does a find operation, like split,  
replace, strip, etc) which requires that both endpoints of its match  
are on a grapheme-boundary. [[Probably the easiest way to implement  
this would be in the regexp engine.]]

A concrete example of that: u'A\N{COMBINING TILDE}\N{COMBINING MACRON  
BELOW}'.find(u'A\N{COMBINING TILDE}') returns 0. But you want a way to  
ask for only a *actual* "A with tilde", not an "A with tilde and  
macron".

Anyhow, I'm not going to tackle this issue or try to push it further,  
but if someone does tackle it, python could grow to have the best  
unicode available. :)

James