break unichr instead of fix ord?

Thu Aug 27 20:49:26 EDT 2009

On 08/26/2009 11:51 PM, "Martin v. Löwis" wrote:
 >[...]
 >>  But regardless, the significant question is, what is
 >>  the reason for having ord() (and unichr) not work for
 >>  surrogate pairs and thus not usable with a large number
 >>  of unicode characters that Python otherwise supports?
 >
 > See PEP 261, http://www.python.org/dev/peps/pep-0261/
 > It specifies all this.

The PEP (AFAICT) says only what we already know... that
on narrow builds unichr() will raise an exception with
an argument >= 0x10000, and ord() is unichr()'s inverse.

I have read the PEP twice now and still see no justification
for that decision, it appears to have been made by fiat.[*1]

Could you or someone please point me to specific justification
for having unichr and ord work only for a subset of unicode
characters on narrow builds, as opposed to the more general
and IMO useful behavior proposed earlier in this thread?

----------------------------------------------------------
[*1]
The PEP says:
     * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
       length-one string.

     * unichr(i) for 2**16 <= i <= TOPCHAR will return a
       length-one string on wide Python builds. On narrow
       builds it will raise ValueError.
and
     * ord() is always the inverse of unichr()

which of course we know; that is the current behavior.  But
there is no reason given for that behavior.

Under the second *unicode bullet point, there are two issues
raised:
   1) Should surrogate pairs be disallowed on narrow builds?
That appears to have been answered in the negative and is
not relevant to my question.
   2) Should access to code points above TOPCHAR be allowed?
Not relevant to my question.

     * every Python Unicode character represents exactly
       one Unicode code point (i.e. Python Unicode
       Character = Abstract Unicode character)

I'm not sure what this means (what's an abstract unicode
character?).  If it mandates that u'\ud800\udc40' be
treated as a len() 2 string, that is that current case
but does not say anything about how unichr and ord
should behave.  If it mandates that that string must
always be treated as two separate code points then
Python itself violates by printing that string as
u'\U00010040' rather than u'\ud800\udc40'.

Finally we read:

     * There is a convention in the Unicode world for
       encoding a 32-bit code point in terms of two
       16-bit code points. These are known as
       "surrogate pairs". Python's codecs will adopt
       this convention.

Is a distinction made between Python and Python
codecs with only the latter having any knowledge of
surrogate pairs?  I guess that would explain why
Python prints a surrogate pair as a single character.
But this seems arbitrary and counter-useful if
applied to ord() and unichr().  What possible
use-case is there for *not* recognizing surrogate
pairs in those two functions?

Nothing else in the PEP seems remotely relevant.