[Python-3000] Making more effective use of slice objects in Py3k

Fri Sep 1 00:04:50 CEST 2006

(Adding back py3k list assuming you just forgot it)

On 8/31/06, Paul Prescod <paul at prescod.net> wrote:
> On 8/31/06, Guido van Rossum <guido at python.org> wrote:
>
> > > (The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes
> > > per character, and doesn't support the supplemental characters above
> > > 0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.)
> >
> > I think we should also support UTF-16, since Java and .NET (and
> > Win32?) appear to be using effectively; making surrogate handling an
> > application issue doesn't seem *too* big of a burden for many apps.
>
> I think that the reason that UTF-16 seems "not too big of a burden" is
> because people just ignore the UTF-16-ness of the data and hope that people
> don't use those characters. In effect they trade correctness and
> internationalization for simplicity and performance. It seems like it may
> become a bigger issue as time goes by.

Well there's a large class of apps that don't do anything for which
surrogates matter, since they just copy strings around and only split
them at specific characters.  E.g. parsing XML would often fall in
this category.

> Plus, it sounds like you're proposing that the encodings of the underlying
> data would leak through to the application. As I understood Fredrick's
> model, the intention was to treat the encoding as an implementation detail.
> If it works well, this could be an important differentiator for Python
> (versus Java) as Unicode already is (versus Ruby).

*Only* for UTF-16, which I consider a necessary evil since we can't
rewrite the Java and .NET standards.

> So my basic feeling is that if we're going to hide UTF-8 from the programmer
> then we might as well go the extra mile and hide UTF-16 as well.

I don't think the issues are the same.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)