break unichr instead of fix ord?

Sat Aug 29 10:38:51 EDT 2009

On 08/28/2009 02:12 AM, "Martin v. Löwis" wrote:

[I reordered the quotes from your previous post to try
and get the responses in a more coherent order.  No
intent to take anything out of context...]

 >>  Nothing else in the PEP seems remotely relevant.
[to providing justification for the behavior of
unichr/ord]
 >
 > Except for the motivation, of course :-)
 >
 > In addition: your original question was "why has this
 > been changed", to which the answer is "it hasn't".

My original interest was two-fold: can unichr/ord be
changed to work in a more general and helpful way?  That
seemed remotely possible until it was pointed out that
the two behave consistently, and that behavior is accurately
documented.  Second, why would they work the way they do
when they could have been generalized to cover the full
unicode space?  An inadequate answer to this would have
provided support for the first point but remains interesting
to me for the reason below.

 > Then, the next question is "why is it implemented that
 > way", to which the answer is "because the PEP says so".

Not at all a satisfying answer unless one believes
in PEPal infallibility. :-)

 > Only *then* the question is "what is the rationale for
 > the PEP specifying things the way it does". The PEP is
 > relevant so that we can both agree that Python behaves
 > correctly (in the sense of behaving as specified).

But my question had become: why that behavior, when a
slightly different behavior would be more general with
little apparent downside?

To clarify, my interest in the justification for the
current behavior is this:

I think the best feature of python is not, as commonly
stated, the clean syntax, but rather the pretty complete
and orthogonal libraries.  I often find, after I have
written some code, that due to the right library functions
being available, it turns out much shorter and concise
than I expected.

Nevertheless, every now and then, perhaps more than in some
other languages (I'm not sure), I run into something that
requires what seems to be excessive coding -- I have to
do something it seems to me that a library function should
have done for me.  Sometimes this is because I don't under-
stand the reason the library function needs to works the
way it does.  Other times it is one of the countless trade-
off made in the design of the language, which didn't happen
to go the way that would have been beneficial to me in a
particular coding situation.

But sometimes (and it feels too often) it seems as though,
zen not withstanding, that purity -- adherence to some
philosophic ideal -- beat practicality.
unichr/ord seems such as case to me,  But I want to be
sure I am not missing something.

The reasons for the current behavior so far:

1.
> What you propose would break the property "unichr(i) always returns
> a string of length one, if it returns anything at all".

Yes.  And i don't see the problem with that.  Why is
that property more desirable than the non-existent
property that a Unicode literal always produces one
python character?  It would only occur on a narrow
build with a unicode character outside of the bmp,
exactly the condition a unicode literal can "behave
differently" by producing two python characters.

2.
> >  But there is no reason given [in the PEP] for that behavior.
> Sure there is, right above the list:
> "Most things will behave identically in the wide and narrow worlds."
> That's the reason: scripts should work the same as much as possible
> in wide and narrow builds.

So what else would work "differently"?  My point was
that extending unichr/ord to work with all unicode
characters reduces differences far more often than
it increase them.

3.
>>       * There is a convention in the Unicode world for
>>         encoding a 32-bit code point in terms of two
>>         16-bit code points. These are known as
>>         "surrogate pairs". Python's codecs will adopt
>>         this convention.
>>
>>  Is a distinction made between Python and Python
>>  codecs with only the latter having any knowledge of
>>  surrogate pairs?
>
> No. In the end, the Unicode type represents code units,
> not code points, i.e. half surrogates are individually
> addressable. Codecs need to adjust to that; in particular
> the UTF-8 and the UTF-32 codec in narrow builds, and the
> UTF-16 codec in wide builds (which didn't exist when the
> PEP was written).

OK, so that is not a reason either.

4.
I'll speculate a little.
If surrogate handling was added to ord/unichr, it would
be the top of a slippery slope leading to demands that
other string functions also handle surrogates.

But this is not true -- there is a strong distinction
between ord/unichr and other string methods.  The latter
deal with strings of multiple characters.  But the former
deals only with single characters (taking a surrogate
pair as a single unicode character.)

The behavior of ord/unichr is independent of the other
string methods -- if they were changed with regard to
surrogate handling they would all have to be changed to
maintain consistent behavior.  Unichr/str affect only
each other.

The functions of ord/unichr -- to map characters to
numbers -- are fundamental string operations, akin to
indexing or extracting a substring.  So why would
one want to limit them to a subset of characters if
not absolutely necessary?

To reiterate, I am not advocating for any change.  I
simply want to understand if there is a good reason
for limiting the use of unchr/ord on narrow builds to
a subset of the unicode characters that Python otherwise
supports.  So far, it seems not and that unichr/ord
is a poster child for "purity beats practicality".