[Python-3000] How will unicode get used?
Adam Olsen
rhamph at gmail.com
Wed Sep 20 20:20:13 CEST 2006
On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > Before we can decide on the internal representation of our unicode
> > objects, we need to decide on their external interface. My thoughts
> > so far:
>
> Let me cut this short. The external string API in Py3k should not
> change or only very marginally so (like removing rarely used useless
> APIs or adding a few new conveniences). The plan is to keep the 2.x
> API that is supported (in 2.x) by both str and unicode, but merge the
> twp string types into one. Anything else could be done just as easily
> before or after Py3k.
Thanks, but one thing remains unclear: is the indexing intended to
represent bytes, code points, or code units? Note that C code
operating on UTF-16 would use code units for slicing of UTF-16, which
splits surrogate pairs.
As far as I can tell, CPython on windows uses UTF-16 with code units.
Perhaps not intentionally, but by default (not throwing an error on
surrogates).
For those trying to make sense of this, a Code Point anything in the 0
to 0x10FFFF range. A Code Unit goes up to 0xFF for UTF-8, 0xFFFF for
UTF-16, and 0xFFFFFFFF for UTF-32. One or more code units may be
needed to form a single code point. Obviously code units expose our
internal implementation choice.
--
Adam Olsen, aka Rhamphoryncus
More information about the Python-3000
mailing list