[Python-ideas] Adding str.isascii() ?

Fri Jan 26 09:18:13 EST 2018

On 26.01.2018 14:55, Victor Stinner wrote:
> 2018-01-26 14:43 GMT+01:00 M.-A. Lemburg <mal at egenix.com>:
>> If that's indeed being used as assumption, the docs must be
>> fixed and PyUnicode_New() should verify this assumption as
>> well - not only in debug builds using C asserts() :-)
> 
> As PyUnicode_FromStringAndSize(NULL, size), PyUnicode_New(size,
> maxchar) only allocates memory with uninitialized characters.
> 
> I don't see how PyUnicode_New() could check the string content since
> the content is unknow yet...

You do have a point there ;-)

I guess making the assumption very clear in the docs would be
a good first step - as Chris suggested.

> The new public C API added by PEP 393 is hard to use correctly, but
> they are the most efficient. Functions like PyUnicode_FromString() are
> simple to use and very hard to misuse :-) PyPy developers asked me to
> simply drop all these new public C API, make them private. At least,
> deprecate them. But I never looked in depth at the new API. I don't
> know if Cython uses it for example.

Dropping them would most likely seriously limit the usefulness
of the Unicode API. If you always have to copy strings to create
objects, this would make text intense work very slow. The usual
approach is to have a three step process:

1. create a container object of sufficient size
2. write data
3. resize container to actual size

I guess marking objects returned by PyUnicode_New() as "not
ready" would help resolve the issue. Whenever the maxchar
check is applied, the ready flag could then be set. The resize
operations would then have to apply the maxchar check as well.

Unfortunately, many of he readiness checks are only available
in debug builds, but at least it's a way forward to make
the API more robust.

> Some APIs are still private like _PyUnicodeWriter which allows to
> create a string in multiple steps with a smart strategy to reduce or
> even avoid realloc() and conversions from the different storage types
> (UCS1, UCS2, UCS4). This API is very efficient, but also hard to use.
> 
>> C extensions can easily create strings using PyUnicode_New()
>> which do not adhere to such a requirement and then write
>> arbitrary content using PyUnicode_WRITE(). In some cases,
>> this may even be necessary, say in case the extension doesn't
>> know what data is being written, reading it from some external
>> source.
> 
> It would be a bug in the C extension.

Is there a way to call an API which fixes the setting
(a public version of unicode_adjust_maxchar()) ?

Without this, how would an extension be able to provide a
correct value upfront without knowing the content ?

>> I'm not too familiar with the new Unicode code, but it seems
>> that this requirement is not checked everywhere, e.g. the
>> resize code doesn't seem to have such checks either (only in
>> debug versions).
> 
> It must be checked everywhere. If it's not the case, it's an obvious
> bug in CPython.
> 
> If you spotted a bug, please report a bug ;-)

Yes, will do.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 26 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/