[Python-3000] How will unicode get used?

Thu Sep 21 00:52:16 CEST 2006

On 9/20/06, Michael Chermside <mcherm at mcherm.com> wrote:
> I wrote:
> >>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.'
> >>> msg[35:-18]
> u'"\U00010143"'
> >>> greek_five = msg[36:-19]
> >>> len(greek_five)
> 2
>
>
> After posting, I realized that it's worse than that. I suspect that if
> I tried this on a CPython compiled with wide characters, then
> len(greek_five) would be 1.
>
> What should it be? 2? 1? Implementation-dependent?

This has all been rehashed endlessly. It's implementation (and
platform- and compilation options-) dependent because there are good
reasons for both choices. Even if CPython 3.0 supports a dynamic
choice (which some are proposing) then the *language* will still make
it implementation dependent because of Jython and IronPython, where
the only choice is UTF-16 (or UCS-2, depending the attitude towards
surrogates).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)