[Python-Dev] Re: New public PyUnicodeBuilder C API

16 May 2022

      On Mon, 16 May 2022 11:13:56 +0200
Victor Stinner  wrote:
...
Hi,
I propose adding a new C API to "build an Unicode string". What do you
think? Would it be efficient with any possible Unicode string storage
and any Python implementation?
PyPy has an UnicodeBuilder type in Python, but here I only propose C
API. Later, if needed, it would be easy to add a Python API for it.
PyPy has UnicodeBuilder to replace "str += str" pattern which is
inefficient in PyPy: CPython has a micro-optimization (in ceval.c) to
keep this pattern performance interesting. Adding a Python API was
discussed in 2020, see the LWN article:
https://lwn.net/Articles/816415/
Example without error handling, naive implementation which doesn't use
known length of key and value strings (calling Preallocate may be more
efficient):
---------------------------
    // Format "key=value"
    PyObject *format_with_builder(PyObject *key, PyObject *value)
    {
        assert(PyUnicode_Check(key));
        assert(PyUnicode_Check(value));
// Allocated on the stack
        PyUnicodeBuilder builder;
        PyUnicodeBuilder_Init(&builder);
//  Overallocation is more efficient if the final length is unknown
        PyUnicodeBuilder_EnableOverallocation(&builder);
        PyUnicodeBuilder_WriteStr(&builder, key);
        PyUnicodeBuilder_WriteChar(&builder, '=');
// Disable overallocation before the last write
        PyUnicodeBuilder_DisableOverallocation(&builder);
Having to manually enable or disable overallocation doesn't sound right.
Overallocation should be done *before* writing, not after. If there are
N bytes remaining and you write N bytes, then no reallocation should
occur.

Regards

Antoine.

[Python-Dev] Re: New public PyUnicodeBuilder C API

Antoine Pitrou