[Python-3000] How will unicode get used?

Wed Sep 20 20:32:04 CEST 2006

On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> > On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > > Before we can decide on the internal representation of our unicode
> > > objects, we need to decide on their external interface.  My thoughts
> > > so far:
> >
> > Let me cut this short. The external string API in Py3k should not
> > change or only very marginally so (like removing rarely used useless
> > APIs or adding a few new conveniences). The plan is to keep the 2.x
> > API that is supported (in 2.x) by both str and unicode, but merge the
> > twp string types into one. Anything else could be done just as easily
> > before or after Py3k.
>
> Thanks, but one thing remains unclear: is the indexing intended to
> represent bytes, code points, or code units?

I don't see what's unclear -- the existing unicode object does what it does.

> Note that C code
> operating on UTF-16 would use code units for slicing of UTF-16, which
> splits surrogate pairs.

I thought we were discussing the Python API.

C code will likely have the same access to unicode objects as it has in 2.x.

> As far as I can tell, CPython on windows uses UTF-16 with code units.
> Perhaps not intentionally, but by default (not throwing an error on
> surrogates).

This is intentional, to be compatible with the rest of that platform.
Jython and IronPython do this too I believe.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)