[Python-Dev] UCS2/UCS4 default
James Y Knight
foom at fuhm.net
Thu Jul 3 18:45:39 CEST 2008
On Jul 3, 2008, at 10:46 AM, Jeroen Ruigrok van der Werven wrote:
> -On [20080703 15:58], Guido van Rossum (guido at python.org) wrote:
>> Your seem to be suggesting that len(u"\U00012345") should return 1 on
>> a system that internally uses UTF-16 and hence represents this string
>> as a surrogate pair.
>
> From a Unicode and UTF-16 point of view that makes the most sense.
> So yes, I
> am suggesting that.
I think this is misguided.
IMO, basically every programming language gets string handling wrong.
(maybe with the exception of the unreleased perl6? it had some
interesting moves in this area, but I haven't really been paying
attention.) Everyone treats strings as arrays, but they are used quite
differently. For a string, there is hardly ever a time when a
programmer needs to index it with an arbitrary offset in number of
codepoints, and the length-in-codepoints is pretty non-useful as well.
Constant-time access to arbitrary codepoints in a string is pretty
much unimportant. What *is* of utmost importantance is constant-time
access to previously-returned points in the string.
I'd like to have 3 levels of access available:
1) "byte"-level. In a new implementation I'd probably choose to make
all my strings stored in UTF-8, but UTF-16 is fine too.
2) codepoint-level.
3) grapheme-level.
You should be able to iterate over the string at any of the levels,
ask for the nearest codepoint/grapheme boundary to the left or right
of an index at a different level, etc.
Python could probably still be made to work kinda like this. I think a
language designed as such in the first place could be nicer, with
opaque index objects into the string rather than integers, and such,
but...whatever.
Let's assume python is changed to always store strings in UTF-16.
All it would take is adding a few more functions to the str object to
operate on the higher levels. Wherever I say "pos" I mean an integer
index into the string, at the UTF-16 level. That may sometimes be
unaligned with the boundary of the representation you're asking about,
and behavior in that case needs to be specified as well.
.nextcodepoint(curpos, how_many=1) -> returns an index into the string
how_many codepoints to the right (or left if negative) of the index
curpos.
.nextgrapheme(curpos, how_many=1) -> returns an index into the string
how_many graphemes to the right (or left if negative) of the index
curpos.
.codepoints(from_pos=0, to_pos=None) -> return an iterator of
codepoints from 'from_pos' to 'to_pos'. I think codepoints could be
represented as strings themselves (so usually one character, sometimes
two character strings).
.graphemes(from_pos=0, to_pos=None) -> return an iterator of graphemes
from 'from_pos' to 'to_pos'. Also could be represented by strings. The
returned graphemes should probably be normalized.
There are a few more desirable operations, to manipulate strings at
the grapheme level (because unlike for UTF-8/UTF-16 codepoints,
graphemes don't have the nice property of not containing prefixes
which are themselves valid graphemes). So, you want a find (and
everything else that implicitly does a find operation, like split,
replace, strip, etc) which requires that both endpoints of its match
are on a grapheme-boundary. [[Probably the easiest way to implement
this would be in the regexp engine.]]
A concrete example of that: u'A\N{COMBINING TILDE}\N{COMBINING MACRON
BELOW}'.find(u'A\N{COMBINING TILDE}') returns 0. But you want a way to
ask for only a *actual* "A with tilde", not an "A with tilde and
macron".
Anyhow, I'm not going to tackle this issue or try to push it further,
but if someone does tackle it, python could grow to have the best
unicode available. :)
James
More information about the Python-Dev
mailing list