break unichr instead of fix ord?
rurpy at yahoo.com
rurpy at yahoo.com
Sun Aug 30 02:12:24 CEST 2009
On 08/29/2009 12:06 PM, Steven D'Aprano wrote:
>> The reasons for the current behavior so far:
>>> What you propose would break the property "unichr(i) always returns a
>>> string of length one, if it returns anything at all".
>> Yes. And i don't see the problem with that. Why is that property more
>> desirable than the non-existent property that a Unicode literal always
>> produces one python character?
> What do you mean? Unicode literals don't always produce one character,
> e.g. u'abcd' is a Unicode literal with four characters.
I'm sorry, I should have been clearer. I meant the literal
representation of a *single* unicode character. u'\u4000'
which results in a string of length 1, vs u'\U00010040' which
results in a string of length 2. In both case the literal
represents a single unicode code point.
> I think it's fairly self-evident that a function called uniCHR [emphasis
> added] should return a single character (technically a single code
There are two concepts of characters here, the 16-bit things
that encodes a character in a Python unicode string (in a
narrow build Python), and a character in the sense of one
of the ~2**10 unicode characters. Python has chosen to
represent the latter (when outside the BMP) as a pair of
surrogate characters from the former. I don't see why one
would assume that CHR would mean the python 16-bit
character concept rather than the full unicode character
concept. In fact, rather the opposite.
> But even if you can come up with a reason for unichr() to return
> two or more characters,
I've given a number of reasons why it should return a two
character representation of a non-BMP character, one of
which is that that is how Python has chosen to represent
such characters internally. I won't repeat the other
I'm not sure why you think more than two characters
would ever be possible.
> this would break code that relies on the
> documented promise that the length of the output of unichr() is always
Ah, OK. This is the good reason I was looking for.
I did not realize (until prompted by your remark
to go back and look at the early docs) that unichr
had been documented to return a single character
since 2.0 and that wide character support was added
in 2.2. Martin v. Loewis also implied that, I now
see, although the implication was too deep for me
to pick up.
So although it leads to a suboptimal situation, I
agree that maintaining the documented behavior was
> I would much rather see a pair of new functions, wideord() and
> widechr() used for converting between surrogate pairs and numbers.
I guess if it were still 2001 and Python 2.2 was
coming out I would be in favor of this too. :-)
More information about the Python-list