[Python-3000] UTF-16

Fri Sep 1 05:46:55 CEST 2006

On 8/31/06, Paul Prescod <paul at prescod.net> wrote:
> On 8/31/06, Guido van Rossum <guido at python.org> wrote:
> > (Adding back py3k list assuming you just forgot it)
>
> Yes, thanks. Gmail's UI really optimizes the "Reply To" operation of "Reply
> To All."
>
> > > Plus, it sounds like you're proposing that the encodings of the
> underlying
> > > data would leak through to the application. As I understood Fredrick's
> > > model, the intention was to treat the encoding as an implementation
> detail.
> > > If it works well, this could be an important differentiator for Python
> > > (versus Java) as Unicode already is (versus Ruby).
> >
> > *Only* for UTF-16, which I consider a necessary evil since we can't
> > rewrite the Java and .NET standards.
>
> I see what you're getting at.
>
> I'd say that decoding UTF-16 data in CPython and PyPy should (by default)
> create true Unicode characters. Jython and IronPython could create
> surrogates and characters when necessary. When you run the program in
> CPython you'll get better behaviour than in Jython/IronPython. Maybe there
> could be a way to make CPython run like Jython and IronPython if you wanted
> 100% absolute compatibility between the environments. I think that we agree
> that it would be unfortunate if CPython copied Java and .NET to its own
> detriment. It's also not inconceivable that Java and .NET might evolve a
> 4-byte mode in the long term.

I think it would be best to do this as a CPython configuration option
just like it's done today. You can choose 4-byte or 2-byte Unicode
(essentially UCS-4 or UTF-16) in order to be compatible with other
packages on the platform. Yes, 4-byte gives better Unicode support.
But 2-bytes may be more compatible with other stuff on the platform.
Too bad .NET and Java don't have this option. :-)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)