[Python-ideas] Adding str.isascii() ?

M.-A. Lemburg mal at egenix.com
Fri Jan 26 08:43:25 EST 2018


On 26.01.2018 14:31, Victor Stinner wrote:
> 2018-01-26 12:17 GMT+01:00 INADA Naoki <songofacandy at gmail.com>:
>>> No, because you can pass in maxchar to PyUnicode_New() and
>>> the implementation will take this as hint to the max code point
>>> used in the string. There is no check done whether maxchar
>>> is indeed the minimum upper bound to the code point ordinals.
>>
>> API doc says:
>>
>> """
>> maxchar should be the true maximum code point to be placed in the string.
>> As an approximation, it can be rounded up to the nearest value in the
>> sequence 127, 255, 65535, 1114111.
>> """
>> https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_New
>>
>> Since doc says *should*, strings created with wrong maxchar
>> are considered invalid object.
> 
> PyUnicode objects must always use the most efficient storage. It's a
> very strong requirement of the PEP 393. As Naoki wrote, many functions
> rely on this assumption to implement fast-path.
> 
> The assumption is even implemented in the debug check
> _PyUnicode_CheckConsistency():
> 
> https://github.com/python/cpython/blob/e76daebc0c8afa3981a4c5a8b54537f756e805de/Objects/unicodeobject.c#L453-L485

If that's indeed being used as assumption, the docs must be
fixed and PyUnicode_New() should verify this assumption as
well - not only in debug builds using C asserts() :-)

Going through the code, I saw a lot of calls to
find_maxchar_surrogates() before calling PyUnicode_New().
This call would then have to be moved inside PyUnicode_New()
instead.

C extensions can easily create strings using PyUnicode_New()
which do not adhere to such a requirement and then write
arbitrary content using PyUnicode_WRITE(). In some cases,
this may even be necessary, say in case the extension doesn't
know what data is being written, reading it from some external
source.

I'm not too familiar with the new Unicode code, but it seems
that this requirement is not checked everywhere, e.g. the
resize code doesn't seem to have such checks either (only in
debug versions).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 26 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/



More information about the Python-ideas mailing list