[Python-Dev] PEP 393 review

Thu Aug 25 23:30:13 CEST 2011

Stefan Behnel, 25.08.2011 20:47:
> "Martin v. Löwis", 24.08.2011 20:15:
>> - issues to be considered (unclarities, bugs, limitations, ...)
>
> A problem of the current implementation is the need for calling
> PyUnicode_(FAST_)READY(), and the fact that it can fail (e.g. due to
> insufficient memory). Basically, this means that even something as trivial
> as trying to get the length of a Unicode string can now result in an error.

Oh, and the same applies to PyUnicode_AS_UNICODE() now. I doubt that there 
is *any* code out there that expects this macro to ever return NULL. This 
means that the current implementation has actually broken the old API. Just 
allocate an "80% of your memory" long string using the new API and then 
call PyUnicode_AS_UNICODE() on it to see what I mean.

Sadly, a quick look at a couple of recent commits in the pep-393 branch 
suggested that it is not even always obvious to you as the authors which 
macros can be called safely and which cannot. I immediately spotted a bug 
in one of the updated core functions (unicode_repr, IIRC) where 
PyUnicode_GET_LENGTH() is called without a previous call to 
PyUnicode_FAST_READY().

I find it everything but obvious that calling PyUnicode_DATA() and 
PyUnicode_KIND() is safe as long as the return value is being checked for 
errors, but calling PyUnicode_GET_LENGTH() is not safe unless there was a 
previous call to PyUnicode_Ready().

> I just noticed this when rewriting Cython's helper function that searches a
> unicode string for a (Py_UCS4) character. Previously, the entire function
> was safe, could never produce an error and therefore always returned a
> boolean result. In the new world, the caller of this function must check
> and propagate errors. This may not be a major issue in most cases, but it
> can have a non-trivial impact on user code, depending on how deep in a call
> chain this happens and on how much control the user has over the call chain
> (think of a C callback, for example).
>
> Also, even in the case that there is no error, the potential need to build
> up the string on request means that the run time and memory requirements of
> an algorithm are less predictable now as they depend on the origin of the
> input and not just its Python level string content.
>
> I would be happier with an implementation that avoided this by always
> instantiating the data buffer right from the start, instead of carrying
> only a Py_UNICODE buffer for old-style instances.

Stefan