[Python-3000] How will unicode get used?

Adam Olsen rhamph at gmail.com
Thu Sep 21 01:02:49 CEST 2006


On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> On 9/20/06, Michael Chermside <mcherm at mcherm.com> wrote:
> > I wrote:
> > >>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.'
> > >>> msg[35:-18]
> > u'"\U00010143"'
> > >>> greek_five = msg[36:-19]
> > >>> len(greek_five)
> > 2
> >
> >
> > After posting, I realized that it's worse than that. I suspect that if
> > I tried this on a CPython compiled with wide characters, then
> > len(greek_five) would be 1.
> >
> > What should it be? 2? 1? Implementation-dependent?
>
> This has all been rehashed endlessly. It's implementation (and
> platform- and compilation options-) dependent because there are good
> reasons for both choices. Even if CPython 3.0 supports a dynamic
> choice (which some are proposing) then the *language* will still make
> it implementation dependent because of Jython and IronPython, where
> the only choice is UTF-16 (or UCS-2, depending the attitude towards
> surrogates).

Wow, you really did mean code units.  In that case I'm very tempted to
support UTF-8, with byte indexing (which is what code units are in its
case).  It's ugly, but it technically works fine, and it's the de
facto standard on Linux.  No more ugly than UTF-16 code units IMO,
just more obvious.

-- 
Adam Olsen, aka Rhamphoryncus


More information about the Python-3000 mailing list