[Python-Dev] len(chr(i)) = 2?

Thu Nov 25 03:17:44 CET 2010

Alexander Belopolsky writes:

 > Any non-trivial text processing is likely to be broken in presence of
 > surrogates.

If you're worried about this, write a UCS-2-producing codec that
rejects surrogates or stuffs them into the private zone of the BMP.
Maybe such a codec should be default, but so far nobody seems to want
one enough; they want UTF-16 even though they know it's wrong.<wink>

One of the things that makes the 16-bit code unit attractive to me is
that the options for working around the variable-width nature of
UTF-16 (without actually implementing conformance to UTF-16 in
internal operations!) are many.  If you use octets as code units, you
don't have such options: you have to do it right.

 > Processing surrogate pairs in python code is hard.

Sure, but as James Knight and MAL point out, so is processing compose
characters, and those errors will go undetected in your proposals,
even with a strict UCS-2 definition.  What can you do?  Banning
composing characters isn't going to fly!

 > Yes, allowing non-trusted users to specify fill character is unlikely,
 > but it is quite likely that naive slicing or iteration over string
 > units would result in
 > 
 > Traceback (most recent call last):

Naive slicing yes, but naive iteration (ie, iteration that consumes
the whole string, or up to a known character, rather than up to a
specified position) is highly unlikely to result in such a traceback.
It is precisely that property (non-BMP characters get passed through
unchanged, or ignored) that makes extension to non-BMP code points
attractive.

 > I agree again, but I feel that exposing code units rather than code
 > points at the Python string level takes us back to 2.x days of mixing
 > bytes and strings.

It does, but there's a difference.  With bytes as UTF-8, only ASCII
values have defined semantics in Unicode.  The rest have semantics
that is context-dependent, and they are frequent in any non-English
processing and many English use cases (math symbols, correctly-
oriented punctuation).  With 16-bit code units, all values have well-
defined semantics in Unicode, and non-characters are going to be
extremely rare in the vast majority of use cases.  IOW, you can think
of Python as a UCS-2 device processing characters, and let surrounding
UTF-16 processors deal with the errors.

 > Let me quote Guido circa 2001 again:
 > 
 > """
 > ... if we had wanted to use a
 > variable-lenth internal representation, we should have picked UTF-8
 > way back, like Perl did.  Moving to a UTF-16-based internal
 > representation now will give us all the problems of the Perl choice
 > without any of the benefits.
 > """
 > 
 > I don't understand what changed since 2001 that made this argument
 > invalid.

Nothing.  The internal representation of Python is UCS-2, not UTF-16.
People who want to think otherwise are kidding themselves.  The
presence of surrogates is not sufficient to call something UTF-16.
Preserving the Unicode code points through any builtin operations is a
necessary condition, and Python doesn't do that.  *However*, in my
opinion, it's not a big deal to allow surrogates in UCS-2 a la ISO
10646-1:1996.  That lets people who want a quick and dirty way to
handle BMP text that *might* (but usually won't) contain some non-BMP
characters go a long way fast.  "Although practicality beats purity."

 > I note that an opinion has been raised on this thread that
 > if we want compressed internal representation for strings, we should
 > use UTF-8.  I tend to agree, but UTF-8 has been repeatedly rejected as
 > too hard to implement.  What makes UTF-16 easier than UTF-8?  Only the
 > fact that you can ignore bugs longer, in my view.

That's mostly true.  My guess is that we can probably ignore those
bugs for as long as it takes someone to write the higher-level
libraries that James suggests and MAL has actually proposed and
started a PEP for.