[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 24 12:06:39 CEST 2011

On 8/24/2011 4:22 AM, Stephen J. Turnbull wrote:
> Terry Reedy writes:
>
>   >  The current UCS2 Unicode string implementation, by design, quickly gives
>   >  WRONG answers for len(), iteration, indexing, and slicing if a string
>   >  contains any non-BMP (surrogate pair) Unicode characters. That may have
>   >  been excusable when there essentially were no such extended chars, and
>   >  the few there were were almost never used.
>
> Well, no, it gives the right answer according to the design.  unicode
> objects do not contain character strings.

Excuse me for believing the fine 3.2 manual that says
"Strings contain Unicode characters." (And to a naive reader, that 
implies that string iteration and indexing should produce Unicode 
characters.)

>  By design, they contain code point strings.

For the purpose of my sentence, the same thing in that code points 
correspond to characters, where 'character' includes ascii control 
'characters' and unicode analogs. The problem is that on narrow builds 
strings are NOT code point sequences. They are 2-byte code *unit* 
sequences. Single non-BMP code points are seen as 2 code units and hence 
given a length of 2, not 1. Strings iterate, index, and slice by 2-byte 
code units, not by code points.

Python floats try to follow the IEEE standard as interpreted for Python 
(Python has its software exceptions rather than signalling versus 
non-signalling hardware signals). Python decimals slavishly follow the 
IEEE decimal standard. Python narrow build unicode breaks the standard 
for non-BMP code points and cosequently, breaks the re module even when 
it works for wide builds. As sys.maxunicode more or less says, only the 
BMP subset is fully supported. Any narrow build string with even 1 
non-BMP char violates the standard.

> Guido has made that absolutely clear on a number
> of occasions.

It is not clear what you mean, but recently on python-ideas he has 
reiterated that he intends bytes and strings to be conceptually 
different. Bytes are computer-oriented binary arrays; strings are 
supposedly human-oriented character/codepoint arrays. Except they are 
not for non-BMP characters/codepoints. Narrow build unicode is 
effectively an array of two-byte binary units.

 > And the reasons have very little to do with lack of
> non-BMP characters to trip up the implementation.  Changing those
> semantics should have been done before the release of Python 3.

The documentation was changed at least a bit for 3.0, and anyway, as 
indicated above, it is easy (especially for new users) to read the docs 
in a way that makes the current behavior buggy. I agree that the 
implementation should have been changed already.

Currently, the meaning of Python code differs on narrow versus wide 
build, and in a way that few users would expect or want. PEP 393 
abolishes narrow builds as we now know them and changes semantics. I was 
answering a complaint about that change. If you do not like the PEP, fine.

My separate proposal in my other post is for an alternative 
implementation but with, I presume, pretty the same visible changes.

> It is not clear to me that it is a good idea to try to decide on "the"
> correct implementation of Unicode strings in Python even today.

If the implementation is invisible to the Python user, as I believe it 
should be without specially introspection, and mostly invisible in the 
C-API except for those who intentionally poke into the details, then the 
implementation can be changed as the consensus on best implementation 
changes.

> There are a number of approaches that I can think of.
>
> 1.  The "too bad if you can't take a joke" approach: do nothing and
>      recommend UTF-32 to those who want len() to DTRT.
> 2.  The "slope is slippery" approach: Implement UTF-16 objects as
>      built-ins, and then try to fend off requests for correct treatment
>      of unnormalized composed characters, normalization, compatibility
>      substitutions, bidi, etc etc.
> 3.  The "are we not hackers?" approach: Implement a transform that
>      maps characters that are not represented by a single code point
>      into Unicode private space, and then see if anybody really needs
>      more than 6400 non-BMP characters.  (Note that this would
>      generalize to composed characters that don't have a one-code-point
>      NFC form and similar non-standardized cases that nonstandard users
>      might want handled.)
> 4.  The "42" approach: sadly, I can't think deeply enough to explain it.
>
> There are probably others.
>
> It's true that Python is going to need good libraries to provide
> correct handling of Unicode strings (as opposed to unicode objects).

Given that 3.0 unicode (string) objects are defined as Unicode character 
strings, I do not see the opposition.

> But it's not clear to me given the wide variety of implementations I
> can imagine that there will be one best implementation, let alone
> which ones are good and Pythonic, and which not so.

-- 
Terry Jan Reedy