On Mon, 16 May 2022 11:13:56 +0200
Victor Stinner
Hi,
I propose adding a new C API to "build an Unicode string". What do you think? Would it be efficient with any possible Unicode string storage and any Python implementation?
PyPy has an UnicodeBuilder type in Python, but here I only propose C API. Later, if needed, it would be easy to add a Python API for it. PyPy has UnicodeBuilder to replace "str += str" pattern which is inefficient in PyPy: CPython has a micro-optimization (in ceval.c) to keep this pattern performance interesting. Adding a Python API was discussed in 2020, see the LWN article: https://lwn.net/Articles/816415/
Example without error handling, naive implementation which doesn't use known length of key and value strings (calling Preallocate may be more efficient): --------------------------- // Format "key=value" PyObject *format_with_builder(PyObject *key, PyObject *value) { assert(PyUnicode_Check(key)); assert(PyUnicode_Check(value));
// Allocated on the stack PyUnicodeBuilder builder; PyUnicodeBuilder_Init(&builder);
// Overallocation is more efficient if the final length is unknown PyUnicodeBuilder_EnableOverallocation(&builder); PyUnicodeBuilder_WriteStr(&builder, key); PyUnicodeBuilder_WriteChar(&builder, '=');
// Disable overallocation before the last write PyUnicodeBuilder_DisableOverallocation(&builder);
Having to manually enable or disable overallocation doesn't sound right. Overallocation should be done *before* writing, not after. If there are N bytes remaining and you write N bytes, then no reallocation should occur. Regards Antoine.