Hi all, at the language summit many people told me that the HPy team should try to communicate more with the CPython developers, so let's try :).
In HPy we want to design an API to build bytes/str objects in two steps, to avoid the problem that currently in CPython they are not really immutable.
Before making any proposal, I spent quite a lot of time in researching how the current API are used to construct bytes/str objects, and I summarized my results here: https://docs.hpyproject.org/en/latest/misc/str-builder-api.html I think that my survey could be interesting for the people in this ML, independently of HPy.
That said, I also opened an issue where to discuss concrete proposals for the HPy API to do that: https://github.com/hpyproject/hpy/issues/214
I would be glad to receive comments and suggestions about that, and especially to know whether I missed some important use case in my analysis.
Also, if you think that these kind of mails are off-topic in this ML, please let me know and I'll stop.
Antonio
I'll answer with general concerns.
A dedicated builder API that allocates the unicode object at the end is really a good idea (PyUnicode_Join is really too slow for high-performance string building)
The builder itself should ideally be a stack variable (even if the allocated string payload is malloc'ed)
There could be separate builder types:
- UCS1, UCS2 and UCS4 builder types (for when you know the width upfront)
- a dynamic width builder type
builders should support presizing and/or reserving more data on the fly
builders should support variants of appending with or without implicit reallocation (the latter, for the case where the right size is fully preallocated)
I'm biased, but I suggest you look at Arrow's BufferBuilder API (C++, but should be relatively to do a C equivalent): https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer_builder.h#L...
It has been serving us well.
Separately from the builder API, there are cases where the data already exists somewhere as a full-blown UTF8 string (this is of course more and more common, since UTF8 is ubiquitous). There should be a fast conversion method from a UTF8 memory area to a unicode object.
Regards
Antoine.
Le 10/06/2021 à 16:48, Antonio Cuni a écrit :
Hi all, at the language summit many people told me that the HPy team should try to communicate more with the CPython developers, so let's try :).
In HPy we want to design an API to build bytes/str objects in two steps, to avoid the problem that currently in CPython they are not really immutable.
Before making any proposal, I spent quite a lot of time in researching how the current API are used to construct bytes/str objects, and I summarized my results here: https://docs.hpyproject.org/en/latest/misc/str-builder-api.html I think that my survey could be interesting for the people in this ML, independently of HPy.
That said, I also opened an issue where to discuss concrete proposals for the HPy API to do that: https://github.com/hpyproject/hpy/issues/214
I would be glad to receive comments and suggestions about that, and especially to know whether I missed some important use case in my analysis.
Also, if you think that these kind of mails are off-topic in this ML, please let me know and I'll stop.
Antonio
capi-sig mailing list -- capi-sig@python.org To unsubscribe send an email to capi-sig-leave@python.org https://mail.python.org/mailman3/lists/capi-sig.python.org/ Member address: antoine@python.org
On 10. 06. 21 17:04, Antoine Pitrou wrote:
Separately from the builder API, there are cases where the data already exists somewhere as a full-blown UTF8 string (this is of course more and more common, since UTF8 is ubiquitous). There should be a fast conversion method from a UTF8 memory area to a unicode object.
Indeed. To me it seems that the analysis misses the importance of UTF-8.
CPython strings currently always have the {1,2,4}-byte "raw buffer"
representation. The UTF8 representation is computed when needed, and
*permanently stored* in the str object. This detail leaks to the API:
notice how PyUnicode_AsUTF8AndSize gives you a const char*
(tied to
the lifetime of the string object), while other codecs can only give you
PyBytes.
I can imagine a future where it could also go the other way: a string could only have a UTF8 representation stored at the start, and the {1,2,4}-byte one would only appear when needed (indexing/slicing, getting the length, etc.).
The analysis says that PyUnicode_FromStringAndSize is deprecated. Please check PEP 623 again: only the special case of calling it with NULL is deprecated. The function, which decodes from UTF-8, is very useful.
On 10. 06. 21 16:48, Antonio Cuni wrote:
Also, if you think that these kind of mails are off-topic in this ML, please let me know and I'll stop.
I think it's perfectly on topic. Thanks for sharing!
Hi,
CPython has two private APIs for that: _PyUnicodeWriter (str) and _PyBytesWriter (bytes/bytearray).
--
_PyUnicodeWriter is an append-only API to build a string. The API is designed for the CPython implementation of Unicode strings. The first call requires to specified the maximum code point. There is a micro-optimization for "%s" % "abc" and "{}".format("abc"): it stores the string "abc" as "read-only" and only create a new buffer at the second append. Internally, it uses an overallocated Python str object.
For best performance, the overallocation can be disabled before the last append. Overallocation when it's not needed can has a significant impact on micro-benchmarks on short strings (1-100 characters): it can require an additional memory copy since the result must be resized to the exact size.
I wrote this class to reduce the slowdown introduced by the initial implementation of the PEP 393 (flexible strings: ASCII, UCS1, UCS2 or UCS4): be as fast or faster than Python 2.
The overallocation factor depends on the platform:
#ifdef MS_WINDOWS /* On Windows, overallocate by 50% is the best factor */ # define OVERALLOCATE_FACTOR 2 #else /* On Linux, overallocate by 25% is the best factor */ # define OVERALLOCATE_FACTOR 4 #endif
--
For bytes, there is _PyBytesWriter which provides a "char *" pointer. It's basically a thin wrapper to control overallocation. In CPython, it internally uses a Python bytes (or bytearray) object, resized on demand, to avoid memory copies when returning the result object (don't allocate a buffer with malloc, from a bytes object from it, and delete the temporary buffer).
It uses a short buffer of 512 bytes allocated on the stack, to avoid an realloc() for short strings.
Overallocation can also be disabled before the last append.
The buffer allocated on the stack requires to expose the buffer size and the structure as part of the ABI, not sure if it's possible in HPy which tries to abstract implementation details.
--
See also the article that I wrote on it: https://vstinner.github.io/pybyteswriter.html
I kept a few microbenchmarks on str%args and str.format(): https://github.com/vstinner/pymicrobench
- bench_bytes_format_int.py
- bench_str_format.py
- bench_str_format_keywords.py
I spent time micro-optimize most common Python operations on strings using these APIs ;-)
--
/* The _PyBytesWriter structure is big: it contains an embedded "stack buffer". A _PyBytesWriter variable must be declared at the end of variables in a function to optimize the memory allocation on the stack. */ typedef struct { /* bytes, bytearray or NULL (when the small buffer is used) */ PyObject *buffer;
/* Number of allocated size. */
Py_ssize_t allocated;
/* Minimum number of allocated bytes,
incremented by _PyBytesWriter_Prepare() */
Py_ssize_t min_size;
/* If non-zero, use a bytearray instead of a bytes object for buffer. */
int use_bytearray;
/* If non-zero, overallocate the buffer (default: 0).
This flag must be zero if use_bytearray is non-zero. */
int overallocate;
/* Stack buffer */
int use_small_buffer;
char small_buffer[512];
} _PyBytesWriter;
/* Initialize a bytes writer
By default, the overallocation is disabled. Set the overallocate attribute to control the allocation of the buffer. */ PyAPI_FUNC(void) _PyBytesWriter_Init(_PyBytesWriter *writer);
/* Get the buffer content and reset the writer. Return a bytes object, or a bytearray object if use_bytearray is non-zero. Raise an exception and return NULL on error. */ PyAPI_FUNC(PyObject *) _PyBytesWriter_Finish(_PyBytesWriter *writer, void *str);
/* Deallocate memory of a writer (clear its internal buffer). */ PyAPI_FUNC(void) _PyBytesWriter_Dealloc(_PyBytesWriter *writer);
/* Allocate the buffer to write size bytes. Return the pointer to the beginning of buffer data. Raise an exception and return NULL on error. */ PyAPI_FUNC(void*) _PyBytesWriter_Alloc(_PyBytesWriter *writer, Py_ssize_t size);
/* Ensure that the buffer is large enough to write *size* bytes. Add size to the writer minimum size (min_size attribute).
str is the current pointer inside the buffer. Return the updated current pointer inside the buffer. Raise an exception and return NULL on error. */ PyAPI_FUNC(void*) _PyBytesWriter_Prepare(_PyBytesWriter *writer, void *str, Py_ssize_t size);
/* Resize the buffer to make it larger. The new buffer may be larger than size bytes because of overallocation. Return the updated current pointer inside the buffer. Raise an exception and return NULL on error.
Note: size must be greater than the number of allocated bytes in the writer.
This function doesn't use the writer minimum size (min_size attribute).
See also _PyBytesWriter_Prepare(). */ PyAPI_FUNC(void*) _PyBytesWriter_Resize(_PyBytesWriter *writer, void *str, Py_ssize_t size);
/* Write bytes. Raise an exception and return NULL on error. */ PyAPI_FUNC(void*) _PyBytesWriter_WriteBytes(_PyBytesWriter *writer, void *str, const void *bytes, Py_ssize_t size);
/* --- _PyUnicodeWriter API ----------------------------------------------- */
typedef struct { PyObject *buffer; void *data; enum PyUnicode_Kind kind; Py_UCS4 maxchar; Py_ssize_t size; Py_ssize_t pos;
/* minimum number of allocated characters (default: 0) */
Py_ssize_t min_length;
/* minimum character (default: 127, ASCII) */
Py_UCS4 min_char;
/* If non-zero, overallocate the buffer (default: 0). */
unsigned char overallocate;
/* If readonly is 1, buffer is a shared string (cannot be modified)
and size is set to 0. */
unsigned char readonly;
} _PyUnicodeWriter ;
/* Initialize a Unicode writer. *
- By default, the minimum buffer size is 0 character and overallocation is
- disabled. Set min_length, min_char and overallocate attributes to control
- the allocation of the buffer. */ PyAPI_FUNC(void) _PyUnicodeWriter_Init(_PyUnicodeWriter *writer);
/* Prepare the buffer to write 'length' characters with the specified maximum character.
Return 0 on success, raise an exception and return -1 on error. */
#define _PyUnicodeWriter_Prepare(WRITER, LENGTH, MAXCHAR)
(((MAXCHAR) <= (WRITER)->maxchar
&& (LENGTH) <= (WRITER)->size - (WRITER)->pos)
? 0
: (((LENGTH) == 0)
? 0
: _PyUnicodeWriter_PrepareInternal((WRITER), (LENGTH), (MAXCHAR))))
/* Don't call this function directly, use the _PyUnicodeWriter_Prepare() macro instead. */ PyAPI_FUNC(int) _PyUnicodeWriter_PrepareInternal(_PyUnicodeWriter *writer, Py_ssize_t length, Py_UCS4 maxchar);
/* Prepare the buffer to have at least the kind KIND. For example, kind=PyUnicode_2BYTE_KIND ensures that the writer will support characters in range U+000-U+FFFF.
Return 0 on success, raise an exception and return -1 on error. */
#define _PyUnicodeWriter_PrepareKind(WRITER, KIND)
(assert((KIND) != PyUnicode_WCHAR_KIND),
(KIND) <= (WRITER)->kind
? 0
: _PyUnicodeWriter_PrepareKindInternal((WRITER), (KIND)))
/* Don't call this function directly, use the _PyUnicodeWriter_PrepareKind() macro instead. */ PyAPI_FUNC(int) _PyUnicodeWriter_PrepareKindInternal(_PyUnicodeWriter *writer, enum PyUnicode_Kind kind);
/* Append a Unicode character. Return 0 on success, raise an exception and return -1 on error. */ PyAPI_FUNC(int) _PyUnicodeWriter_WriteChar(_PyUnicodeWriter *writer, Py_UCS4 ch );
/* Append a Unicode string. Return 0 on success, raise an exception and return -1 on error. */ PyAPI_FUNC(int) _PyUnicodeWriter_WriteStr(_PyUnicodeWriter *writer, PyObject *str /* Unicode string */ );
/* Append a substring of a Unicode string. Return 0 on success, raise an exception and return -1 on error. */ PyAPI_FUNC(int) _PyUnicodeWriter_WriteSubstring(_PyUnicodeWriter *writer, PyObject *str, /* Unicode string */ Py_ssize_t start, Py_ssize_t end );
/* Append an ASCII-encoded byte string. Return 0 on success, raise an exception and return -1 on error. */ PyAPI_FUNC(int) _PyUnicodeWriter_WriteASCIIString(_PyUnicodeWriter *writer, const char *str, /* ASCII-encoded byte string */ Py_ssize_t len /* number of bytes, or -1 if unknown */ );
/* Append a latin1-encoded byte string. Return 0 on success, raise an exception and return -1 on error. */ PyAPI_FUNC(int) _PyUnicodeWriter_WriteLatin1String(_PyUnicodeWriter *writer, const char *str, /* latin1-encoded byte string */ Py_ssize_t len /* length in bytes */ );
/* Get the value of the writer as a Unicode string. Clear the buffer of the writer. Raise an exception and return NULL on error. */ PyAPI_FUNC(PyObject *) _PyUnicodeWriter_Finish(_PyUnicodeWriter *writer);
/* Deallocate memory of a writer (clear its internal buffer). */ PyAPI_FUNC(void) _PyUnicodeWriter_Dealloc(_PyUnicodeWriter *writer);
/* Format the object based on the format_spec, as defined in PEP 3101 (Advanced String Formatting). */ PyAPI_FUNC(int) _PyUnicode_FormatAdvancedWriter( _PyUnicodeWriter *writer, PyObject *obj, PyObject *format_spec, Py_ssize_t start, Py_ssize_t end);
Victor
On Thu, Jun 10, 2021 at 5:05 PM Antoine Pitrou <antoine@python.org> wrote:
I'll answer with general concerns.
A dedicated builder API that allocates the unicode object at the end is really a good idea (PyUnicode_Join is really too slow for high-performance string building)
The builder itself should ideally be a stack variable (even if the allocated string payload is malloc'ed)
There could be separate builder types:
- UCS1, UCS2 and UCS4 builder types (for when you know the width upfront)
- a dynamic width builder type
builders should support presizing and/or reserving more data on the fly
builders should support variants of appending with or without implicit reallocation (the latter, for the case where the right size is fully preallocated)
I'm biased, but I suggest you look at Arrow's BufferBuilder API (C++, but should be relatively to do a C equivalent): https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer_builder.h#L...
It has been serving us well.
Separately from the builder API, there are cases where the data already exists somewhere as a full-blown UTF8 string (this is of course more and more common, since UTF8 is ubiquitous). There should be a fast conversion method from a UTF8 memory area to a unicode object.
Regards
Antoine.
Le 10/06/2021 à 16:48, Antonio Cuni a écrit :
Hi all, at the language summit many people told me that the HPy team should try to communicate more with the CPython developers, so let's try :).
In HPy we want to design an API to build bytes/str objects in two steps, to avoid the problem that currently in CPython they are not really immutable.
Before making any proposal, I spent quite a lot of time in researching how the current API are used to construct bytes/str objects, and I summarized my results here: https://docs.hpyproject.org/en/latest/misc/str-builder-api.html I think that my survey could be interesting for the people in this ML, independently of HPy.
That said, I also opened an issue where to discuss concrete proposals for the HPy API to do that: https://github.com/hpyproject/hpy/issues/214
I would be glad to receive comments and suggestions about that, and especially to know whether I missed some important use case in my analysis.
Also, if you think that these kind of mails are off-topic in this ML, please let me know and I'll stop.
Antonio
capi-sig mailing list -- capi-sig@python.org To unsubscribe send an email to capi-sig-leave@python.org https://mail.python.org/mailman3/lists/capi-sig.python.org/ Member address: antoine@python.org
capi-sig mailing list -- capi-sig@python.org To unsubscribe send an email to capi-sig-leave@python.org https://mail.python.org/mailman3/lists/capi-sig.python.org/ Member address: vstinner@python.org
-- Night gathers, and now my watch begins. It shall not end until my death.
On Thu, Jun 10, 2021 at 5:05 PM Antoine Pitrou <antoine@python.org> wrote:
I'm biased, but I suggest you look at Arrow's BufferBuilder API (C++, but should be relatively to do a C equivalent): https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer_builder.h#L...
I really like that ArrowBuilder exposes functions for manually resizing the buffer in a variety of ways. It feels like it allows users to make building strings more efficient if they have more information about when to perform allocations available but to get something reasonable out of the box if they don't.
On 10.06.2021 16:48, Antonio Cuni wrote:
Hi all, at the language summit many people told me that the HPy team should try to communicate more with the CPython developers, so let's try :).
In HPy we want to design an API to build bytes/str objects in two steps, to avoid the problem that currently in CPython they are not really immutable.
Before making any proposal, I spent quite a lot of time in researching how the current API are used to construct bytes/str objects, and I summarized my results here: https://docs.hpyproject.org/en/latest/misc/str-builder-api.html I think that my survey could be interesting for the people in this ML, independently of HPy.
As Antoine mentioned, we are missing a way to construct bytes and unicode objects without knowing the final size in advance.
Internally, there's an API to resize both objects after pre-allocating them to a guessed final size, but this is not exposed in the public API, yet, this is essential for any application which works in streaming mode or for writing codecs.
The suggested builder model is also used internally in CPython already, but it would be good to not provide direct access to the underlying buffer and instead have the functions handle the writing in a more abstract way.
Regarding the UTF-8 case: This is becoming more and more important as the world standardizes on this encoding. It is well possible that we'll see another rewrite of the CPython Unicode implementation to use UTF-8 as the internal encoding, together with a dynamically built index for fast direct code point access.
The current implementation is way too complex IMO, with too much code duplication, too many edge case and often inefficient storage patterns (e.g. you have to go for UCS4 storage even if your 1MB mostly ASCOO string only has a few non-BMP code points used for emojis).
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jun 11 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
Hi all, thank you for your suggestions and remarks, it's a lot of useful info. I thought of replying to each email but it would become a mess of subthreads, so let me try to summarize here what I got from the answers so far.
From the API design point of view, I think that we can distinguish a few main use cases that we want to support:
- fixed size vs growable
- in case of growable: automatic/implicit growth vs user-controlled growth. The Arrow's BufferBuilder API looks like a very good starting point to me.
- known vs unknown in-memory format (e.g. UCS*, UTF-8, etc.)
- direct access to the buffer vs API-only write
Not all combinations of the above options make sense. E.g., "direct access to the buffer" makes sense only if the in-memory format is known. Still, there are probably too many possible combinations to be able to support all of them in the most efficient manner.
I think a reasonable compromise could be the following:
- have an API for the fixed-size, direct-buffer, known-format case: this would be more or less the equivalent of the current PyUnicode_New + PyUnicode_{1,2,4}BYTE_DATA.
- have an API for a growable, opaque builder, where the implementation can choose whatever in-memory format fits best.
The most notable missing combination of "fixed-size, opaque builder", but I think that a smart implementation of (2) could cover this use case with a very low/almost zero overhead, especially if we make it possible to preallocate the builder to a specific size.
For HPy, the most urgent thing to define is the fixed-size+direct buffer API described in (1): this is needed to give existing extensions an easy and efficient way to migrate their code based on PyUnicode_New. The growing-buffer API is less urgent and can be discussed/designed/implemented as a 2nd step, especially if CPython is thinking of adding such a system by itself.
Implementation-wise, the idea of allocating the builder itself and/or a small buffer directly on the stack (like _PyBytesWriter) sounds good. However, in the case of HPy it is not completely obvious how to make it possible without compromising the ability of implementations do to something different: allocating the builder and/or a small buffer on the stack means that the struct must be known to the compiler, can't be fully opaque and becomes part of the ABI, which is sub-optimal (e.g., depending on the GC it might be more efficient to allocate the buffer on the heap instead of the stack, and each implementation should be able to make its own choice here).
Maybe a good compromise could be something like what is described here: https://nullprogram.com/blog/2019/10/28/ TL;DR: the ABI exposes a function to get the size of the builder, and then you can use alloca to allocate it on the stack, while still keeping it fully opaque (all of this possibly wrapped by some macro to make the API nicer to use). It sounds like something worth experimenting with.
Finally, some in-line replies about UTF-8.
(replying to Antoine)
Separately from the builder API, there are cases where the data already exists somewhere as a full-blown UTF8 string (this is of course more and more common, since UTF8 is ubiquitous). There should be a fast conversion method from a UTF8 memory area to a unicode object.
If you already have your UTF8 string in a buffer, you can already convert it using PyUnicode_FromString, PyUnicode_FromStringAndSize and PyUnicode_DecodeUTF8. Then, CPython has to do an expensive decoding pass because of the way PyUnicode is represented internally, but e.g. PyPy uses UTF8 internally so could implement them with a fast validity check + memcpy().
I think that what the current API is missing is a way to avoid the temporary buffer+memcpy() in case you are read()ing UTF8 data from disk or socket. For HPy, I am thinking of adding another "kind" to HPyUnicodeBuilder in addition to UCS{1,2,4}. E.g., something like this:
HPyUnicodeBuilder_ASCII(ctx, size, &buf); HPyUnicodeBuilder_UCS1(ctx, size, &buf); HPyUnicodeBuilder_UCS2(ctx, size, &buf); HPyUnicodeBuilder_UCS4(ctx, size, &buf); HPyUnicodeBuilder_UTF8(ctx, size, &buf);
On CPython, HPyUnicodeBuilder_UTF8 could be implemented with a temporary buffer + PyUnicode_FromStringAndSize or similar, but on PyPy we can implement it more efficiently without copying. Similarly, HPyUnicodeBuilder_UCS{1,2,4} can be implemented efficiently on CPython but will require a copy/encoding on PyPy.
(replying to Petr)
CPython strings currently always have the {1,2,4}-byte "raw buffer" representation. The UTF8 representation is computed when needed, and *permanently stored* in the str object. This detail leaks to the API: notice how PyUnicode_AsUTF8AndSize gives you a
const char*
(tied to the lifetime of the string object), while other codecs can only give you PyBytes.
thank you for pointing this out, I didn't notice so far. Fortunately, this part of the problem is "automatically" solved in HPy by design: the returned buffers are always const (and if you want to mutate them, you need a builder) and the lifetime is tied to the handle and not to the object. E.g., PyPy implements HPyBytes_AsString by pinning the underlying buffer, and unpinning upon HPy_Close.
A related open question is what to do with the hypotetical HPyUnicode_As{1,2,4}ByteData (which would be functions, not macros of course). E.g. in the PyPy case we don't have the UCS* representation ready, so we would need to decode the string on the fly.
On Thu, Jun 10, 2021 at 4:48 PM Antonio Cuni <anto.cuni@gmail.com> wrote:
Hi all, at the language summit many people told me that the HPy team should try to communicate more with the CPython developers, so let's try :).
In HPy we want to design an API to build bytes/str objects in two steps, to avoid the problem that currently in CPython they are not really immutable.
Before making any proposal, I spent quite a lot of time in researching how the current API are used to construct bytes/str objects, and I summarized my results here: https://docs.hpyproject.org/en/latest/misc/str-builder-api.html I think that my survey could be interesting for the people in this ML, independently of HPy.
That said, I also opened an issue where to discuss concrete proposals for the HPy API to do that: https://github.com/hpyproject/hpy/issues/214
I would be glad to receive comments and suggestions about that, and especially to know whether I missed some important use case in my analysis.
Also, if you think that these kind of mails are off-topic in this ML, please let me know and I'll stop.
Antonio
participants (6)
-
Antoine Pitrou
-
Antonio Cuni
-
Marc-Andre Lemburg
-
Petr Viktorin
-
Simon Cross
-
Victor Stinner