[Python-3000] How will unicode get used?

Thu Sep 21 03:09:24 CEST 2006

Brett Cannon wrote:
> On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
>> On 9/20/06, Guido van Rossum <guido at python.org> wrote:
>> > On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
>> > >
>> > > Before we can decide on the internal representation of our unicode
>> > > objects, we need to decide on their external interface.  My thoughts
>> > > so far:
>> >
>> > Let me cut this short. The external string API in Py3k should not
>> > change or only very marginally so (like removing rarely used useless
>> > APIs or adding a few new conveniences). The plan is to keep the 2.x
>> > API that is supported (in 2.x) by both str and unicode, but merge the
>> > two string types into one. Anything else could be done just as easily
>> > before or after Py3k.
>>
>> Thanks, but one thing remains unclear: is the indexing intended to
>> represent bytes, code points, or code units?  Note that C code
>> operating on UTF-16 would use code units for slicing of UTF-16, which
>> splits surrogate pairs.
> 
> Assuming my Unicode lingo is right and code point represents a
> letter/character/digraph/whatever, then it will be a code point.  Doing one
> of my rare channels of Guido, I *really* doubt he wants to expose the
> technical details of Unicode to the point of having people need to realize
> that UTF-8 takes two bytes to represent "ö".

The argument used here is not valid. People do need to realize that *all*
Unicode encodings are variable-length, in the sense that abstract characters
can be represented by multiple code points.

For example, "ö" can be represented either as the precomposed character U+00F6,
or as "o" followed by a combining diaeresis (U+006F U+0308). Programs must
avoid splitting sequences of code points that represent a single abstract
character. A program that does that correctly will automatically also avoid
splitting within the representation of a code point, whatever UTF is used.

> If you want that kind of
> exposure, use the bytes type.  Otherwise assume the usage will be by people
> ignorant of Unicode and thus want something that will work the way they are
> used to when compared to working in ASCII.

It simply is not possible to do correct string processing in Unicode that
will "work the way [programmers] are used to when compared to working in ASCII".

The Unicode standard is on-line at www.unicode.org, and is quite well written,
with lots of motivation and explanation of how processing international texts
necessarily differs from working with ASCII. There is no excuse for any
programmer doing text processing not to have read it.

Should we nevertheless try to avoid making the use of Unicode strings
unnecessarily difficult for people who have minimal knowledge of Unicode?
Absolutely, but not at the expense of making basic operations on strings
asymptotically less efficient. O(1) indexing and slicing is a basic
requirement, even if it has to be done using code units.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>