[Python-Dev] len(chr(i)) = 2?

Mon Nov 22 13:47:29 CET 2010

Martin,

it is really irrelevant whether the standards have decided
to no longer use the terms UCS-2 and UCS-4 in their latest
standard documents.

The definitions still stand (just like Unicode 2.0 is still a valid
standard, even if it's ten years old):

* UCS-2 is defined as "Universal Character Set coded in 2 octets"
by ISO 10464: (see http://www.unicode.org/versions/Unicode5.2.0/appC.pdf)

* UCS-4 is defined as "Universal Character Set coded in 4 octets"
by ISO 10464.

Those two terms have been in use for many years. They refer to
the Unicode character set as it can be represented in 2 or 4
bytes. As such they don't include any of the special meanings
associated with the UTF transfer encodings. There are no invalid
sequences, no invalid code points, etc. as you can find in the UTF
encodings. And that's an important detail.

If you interpret them as encodings, they are 1-1 mappings of
Unicode code point ordinals to integers represented using
2 or 4 bytes.

UCS-2 only supports BMP code points and can conveniently
be interpreted as UTF-16, if you need to encode non-BMP
code points (which we do in the UTF codecs).

UCS-4 also supports non-BMP code points directly.

Now, from a ISO or Unicode Consortium point of view, deprecating
the term UCS-2 in *their* standard papers is only natural, since
they are actively starting to assign non-BMP code points which
cannot be represented in UCS-2.

However, this deprecation is only relevant for the purpose of defining
the standard. The above definitions are still useful
when it comes to defining code units, i.e. the used storage format,
(as opposed to the transfer format).

For the purpose of describing the code units we are using in Python
they are (still) the most correct terms and that's also the reason
why we chose to use them when introducing the configure options
in Python2.

There are no other accurate definitions we could use. The terms
"narrow" and "wide" are simply too inaccurate to be used as
description of UCS-2 and UCS-4 code units.

Please also note that we have used the terms UCS-2 and UCS-4 in Python2
for 9+ years now and users are just starting to learn the difference
and get acquainted with the fact that Python uses these two forms.

Confronting them with "narrow" and "wide" builds is only
going to cause more confusion, not less, and adding those
strings to Python package files isn't going to help much either,
since the terms don't convey any relationship to Unicode:

package-3.1.3.linux-x86_64-py2.6_ucs2.egg
vs.
package-3.1.3.linux-x86_64-py2.6_narrow.egg

I opt for switching to the following config options:

--with-unicode=ucs2 (default)
--with-unicode=ucs4

and using "UCS-2" and "UCS-4" in the Python documentation when
describing the two different build modes.  We can add glossary
entries for the two which clarify the differences.

Python2 used --enable-unicode=ucs2/ucs4, but since Python3 doesn't
build without Unicode support, the above two versions appear more
appropriate.

We can keep the alternative --with-wide-unicode as an alias
for --with-unicode=ucs4 to maintain 3.x backwards compatibility.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 22 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/

"Martin v. Löwis" wrote:
> Am 22.11.2010 11:48, schrieb Stephen J. Turnbull:
>> Raymond Hettinger writes:
>>
>>  > Neither UTF-16 nor UCS-2 is exactly correct anyway.
>>
>> >From a standards lawyer point of view, UCS-2 is exactly correct, as
>> far as I can tell upon rereading ISO 10646-1, especially Annexes H
>> ("retransmitting devices") and Q ("UTF-16").  Annex Q makes it clear
>> that UTF-16 was intentionally designed so that Python-style processing
>> could be done in a UCS-2 context.
> 
> I could only find the FCD of 10646:2010, where annex H was integrated
> into section 10:
> 
> http://www.itscj.ipsj.or.jp/sc2/open/02n4125/FCD10646-Main.pdf
> 
> There they have stopped using the term UCS-2, and added a note
> 
> # NOTE – Former editions of this standard included references to a
> # two-octet BMP form called UCS-2 which would be a subset
> # of the UTF-16 encoding form restricted to the BMP UCS scalar values. #
> The UCS-2 form is deprecated.
> 
> I think they are now acknowledging that UCS-2 was a misleading term,
> making it ambiguous whether this refers to a CCS, a CEF, or a CES;
> like "ASCII", people have been using it for all three of them.
> 
> Apparently, the ISO WG interprets earlier revisions as saying that
> UCS-2 is a CEF that restricted UTF-16 to the BMP. THIS IS NOT WHAT
> PYTHON DOES. In a narrow Python build, the character set is *not*
> restricted to the BMP. Instead, Unicode strings are meant to be
> interpreted (by applications) as UTF-16.
> 
>>  > For the "wide" build, the entire range of unicode is encoded at
>>  > 4 bytes per character and slicing/len operate correctly since
>>  > every character is the same length.   This used to be called UCS-4
>>  > and is now UTF-32.
>>
>> That's inaccurate, I believe.  UCS-4 is not a UTF, and doesn't satisfy
>> the range restrictions of a UTF.
> 
> Not sure what it says in your copy; in mine, section 9.3 says
> 
> # 9.3 UTF-32 (UCS-4)
> # UTF-32 (or UCS-4) is the UCS encoding form that assigns each UCS
> # scalar value to a single unsigned 32-bit code unit. The terms UTF-32 #
> and UCS-4 can be used interchangeably to designate this encoding
> # form.
> 
> so they (now) view the two as synonyms.
> 
> I think that when ISO 10646 started, they were also fairly confused
> about these issues (as the group/plane/row/cell structure demonstrates,
> IMO). This is not surprising, since the notion of byte-based character
> sets had been ingrained for so long. It took 20 years to learn that
> a UCS scalar value really is *not* a sequence of bytes, but a natural
> number.
> 
>> However, I don't see how "narrow" tells us more than "UCS-2" does.  If
>> "UCS-2" is equally (or more) informative, I prefer it because it is
>> the technically precise, already well-defined, term.
> 
> But it's not. It is a confusing term, one that the relevant standards
> bodies are abandoning. After reading FCD 10646:2010, I could agree to
> call the two implementations UTF-16 and UTF-32 (as these terms
> designate CEFs). Unfortunately, they also designate CESs.
> 
>> If we have to document what the terms we choose mean anyway, why not
>> document the existing terms and reduce entropy, rather than invent new
>> ones and increase entropy?
> 
> Because the proposed existing term is deprecated.
> 
> Regards,
> Martin