[Python-3000] How will unicode get used?

Adam Olsen rhamph at gmail.com
Wed Sep 20 20:43:03 CEST 2006


On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> > > On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > > > Before we can decide on the internal representation of our unicode
> > > > objects, we need to decide on their external interface.  My thoughts
> > > > so far:
> > >
> > > Let me cut this short. The external string API in Py3k should not
> > > change or only very marginally so (like removing rarely used useless
> > > APIs or adding a few new conveniences). The plan is to keep the 2.x
> > > API that is supported (in 2.x) by both str and unicode, but merge the
> > > twp string types into one. Anything else could be done just as easily
> > > before or after Py3k.
> >
> > Thanks, but one thing remains unclear: is the indexing intended to
> > represent bytes, code points, or code units?
>
> I don't see what's unclear -- the existing unicode object does what it does.

The existing unicode object doesn't expose the difference between them
except when UTF-16 is used and surrogates exist.


> > Note that C code
> > operating on UTF-16 would use code units for slicing of UTF-16, which
> > splits surrogate pairs.
>
> I thought we were discussing the Python API.
>
> C code will likely have the same access to unicode objects as it has in 2.x.

I only mentioned it because C doesn't mind exposing the internal
details for performance benefits, whereas python usually does mind.


> > As far as I can tell, CPython on windows uses UTF-16 with code units.
> > Perhaps not intentionally, but by default (not throwing an error on
> > surrogates).
>
> This is intentional, to be compatible with the rest of that platform.
> Jython and IronPython do this too I believe.

So you're saying we should use code units?!  Or are you referring to
the choice of UTF-16?

I would expect us to use code points in 3.x, but that's not how it is in 2.x.

-- 
Adam Olsen, aka Rhamphoryncus


More information about the Python-3000 mailing list