[Python-3000] How will unicode get used?

Sat Sep 23 14:09:00 CEST 2006

David Hopwood schrieb:
>> Assuming my Unicode lingo is right and code point represents a
>> letter/character/digraph/whatever, then it will be a code point.  Doing one
>> of my rare channels of Guido, I *really* doubt he wants to expose the
>> technical details of Unicode to the point of having people need to realize
>> that UTF-8 takes two bytes to represent "ö".
> 
> The argument used here is not valid. People do need to realize that *all*
> Unicode encodings are variable-length, in the sense that abstract characters
> can be represented by multiple code points.

Brett did not make such an argument. He made an argument that users
should not need to care that "ö" in UTF-8 is two bytes. And I agree:
users should not have to worry about this wrt. internal representation.

> For example, "ö" can be represented either as the precomposed character U+00F6,
> or as "o" followed by a combining diaeresis (U+006F U+0308). Programs must
> avoid splitting sequences of code points that represent a single abstract
> character.

Why is that? Many programs never encounter cases where this would
matter, so why do such program have to operate correctly if that case
was encountered?

> It simply is not possible to do correct string processing in Unicode that
> will "work the way [programmers] are used to when compared to working in ASCII".

Brett didn't say that this was a goal.

> Should we nevertheless try to avoid making the use of Unicode strings
> unnecessarily difficult for people who have minimal knowledge of Unicode?
> Absolutely, but not at the expense of making basic operations on strings
> asymptotically less efficient. O(1) indexing and slicing is a basic
> requirement, even if it has to be done using code units.

It's not possible to implement slicing in constant time, unless string
views are introduced. Currently, slicing takes time linear with the
length of the result string.

Regards,
Martin