[Python-Dev] len(chr(i)) = 2?

Fri Nov 19 23:25:03 CET 2010

Victor Stinner wrote:
> Hi,
> 
> On Friday 19 November 2010 17:53:58 Alexander Belopolsky wrote:
>> I was recently surprised to learn that chr(i) can produce a string of
>> length 2 in python 3.x.
> 
> Yes, but only on narrow build. Eg. Debian and Ubuntu compile Python 3.1 in 
> wide mode (sys.maxunicode == 1114111).
> 
>> I suspect that I am not alone finding this behavior non-obvious 
>> given that a mistake in Python manual stating the contrary survived 
>> several releases.  [1]
> 
> It was a documentation bug and you fixed it. Non-BMP characters are rare, so 
> few (maybe only you?) noticed the documentation bug. I consider the behaviour 
> as an improvment of non-BMP support of Python3.
> 
> Python is unclear about non-BMP characters: narrow build was called "ucs2" for 
> long time, even if it is UTF-16 (each character is encoded to one or two 
> UTF-16 words).

No, no, no :-)

UCS2 and UCS4 are more appropriate than "narrow" and "wide" or even
"UTF-16" and "UTF-32".

It'S rather common to confuse a transfer encoding with a storage format.
UCS2 and UCS4 refer to code units (the storage format). You can use
UCS2 and UCS4 code units to represent UTF-16 and UTF-32 resp., but those
are not the same things.

In UTF-16 0xD800 has a special meaning, in UCS2 it doesn't.
Python uses UCS2 internally. It does not assign a special meaning
to those surrogate code point ranges.

However, when it comes to codecs, we do try to make use of the fact
that UCS2 can easily be used to represent an UTF-16 encoding and
that's why you often see surrogates being created for code points
that wouldn't otherwise fit into UCS2 and you see those surrogates
being converted back to single code units in UCS4 builds.

I don't know who invented the terms "narrow" and "wide" builds
for Python3. Not me that's for sure :-) They don't have any
meaning in Unicode terminology and thus cause even more confusion
than UCS2 and UCS4. E.g. the import errors you
get when importing extensions built for a different Unicode
version, (correctly) refer to UCS2 vs. UCS4 and now give even less
of a clue that they relate to difference in Unicode builds (since
these are now labeled "narrow" and "wide").

IMO, we should go back to the Python2 terms UCS2 and UCS4 which
are correct and provide a clear description of what Python uses
internally for code units.

> Python2 accepts non-BMP characters with \U syntax, but not with 
> chr(). This is inconsistent and I see this as a bug. But I don't want to touch 
> Python2 about non-BMP characters, and the "bug" is already fixed in Python3!
> 
>> I do believe, however that a change like
>> this [2] and its consequences should be better publicized.
> 
> Change made before the release of Python 3.0. Do you want to patch the "What's 
> new in Python 3.0?" document?

Perhaps add a section "What we forgot to mention in 3.0" or
"What's not so new in 3.2" to "What's new in 3.2" :-)

>> I have not
>> found any discussion of this change in PEPs or "What's new" documents.
>>  The closest find was a mentioning of a related issue #3280 in the 3.0
>> NEWS file. [3]  Since this feature will be first documented in the
>> Library Reference in 3.2, I wonder if it will be appropriate to
>> mention it in "What's new in 3.2"?
> 
> In my opinion, the question is more what was it not fixed in Python2. I suppose 
> that the answer is something ugly like "backward compatibility" or "historical 
> reasons" :-)

Backwards compatibility.

Python2 applications don't expect unichr(i)
to return anything other than a single character. If you need this
in Python2, it's easy enough to get around, though, with a little
helper function.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 19 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/