[Python-Dev] PyUnicode_GetMax() and PyUnicode_FromOrdinal() Was: Breaking undocumented API

Tue Nov 16 21:06:15 CET 2010

Alexander Belopolsky wrote:
> On Tue, Nov 16, 2010 at 1:57 PM, M.-A. Lemburg <mal at egenix.com> wrote:
>> Alexander Belopolsky wrote:
>>> On Tue, Nov 16, 2010 at 1:06 PM, M.-A. Lemburg <mal at egenix.com> wrote:
>>> ..
>>>> Now, we can't use a macro for [PyUnicode_GetMax()], since the information has
>>>> to be available as callable in order to applications or extensions
>>>> to use it (without recompile).
>>>>
>>>
>>> .. but it *is* a macro resolving to either PyUnicodeUCS2_GetMax or
>>> PyUnicodeUCS4_GetMax.
>>
>> That doesn't count :-) It's only a trick to prevent external code
>> from using the wrong Unicode APIs.
>>
>> There still is a real function behind the renaming.
>>
>>> What is the scenario when may want to change
>>> what PyUnicodeUCS?_GetMax return and have extensions pick up the
>>> change without a recompile?
>>
>> If an extensions uses the stable ABI, it will want to know
>> whether the interpreter was built for UCS2 or UCS4 (even if
>> it doesn't use the Unicode APIs directly).
>>
>>> UCS2 case will certainly never change
>>> since it is already 0xFFFF.  Is it possible that USC4 will be expanded
>>> beyond 0x10FFFF?
>>
>> Well, the Unicode Consortium decided to not go beyond 0x10FFFF,
>> but then you never know... when they started out on the quest,
>> 16 bits appeared more than enough, but they found out relatively
>> quickly that the Asian scripts had enough code points to easily
>> fill that space.
>>
>> Once space is available, it tends to get used sooner or later :-)
>>
>>> Note that we can have both a macro and a function
>>> version.  This is fairly standard practice in Python C-API.
>>
>> Sure, but what for ?
> 
> Note that PyUnicode_FromOrdinal()  is documented (in unicodeobject.h)
> as follows without a reference to PyUnicode_GetMax():
> 
> """
>    Create a Unicode Object from the given Unicode code point ordinal.
> 
>    The ordinal must be in range(0x10000) on narrow Python builds
>    (UCS2), and range(0x110000) on wide builds (UCS4). A ValueError is
>    raised in case it is not.
> """
>
> The actual implementation actually checks UCS4 range only.
> 
>     if (ordinal < 0 || ordinal > 0x10ffff) {
> 	PyErr_SetString(PyExc_ValueError,
>                         "chr() arg not in range(0x110000)");
>         return NULL;
>     }
> 
> This actually looks like a bug:
> 
>>>> len(chr(0x10FFFF))
> 2
> 
> (on a USC2 build.)

Yes, it's a documentation bug. I guess someone forgot to update
the comment in unicodeobject.h after the change to have chr()/unichr()
return a 2-char string instead of a 1-char string for non-BMP
code points.

> Also, I think PyUnicode_FromOrdinal()  should take Py_UNICODE argument
> rather than int.

No, an ordinal is a number, not a typed value. We have
PyUnicode_FromUnicode() to create strings from Py_UNICODE*
arrays.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 16 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/