Question about the current implementation of str
I have a straightforward question about the str object, specifically the PyUnicodeObject. I've tried reading the source to answer the question myself but it's nearly impenetrable. So I was hoping someone here who understands the current implementation could answer it for me. Although the str object is immutable from Python's perspective, the C object itself is mutable. For example, for dynamically-created strings the hash field may be lazy-computed and cached inside the object. I was wondering if there were other fields like this. For example, are there similar lazy-computed cached objects for the different encoded versions (utf8 utf16) of the str? What would really help an exhaustive list of the fields of a str object that may ever change after the object's initial creation. Thanks! We now return you to the debate about the pathlib module, //arry/
On 9 April 2016 at 10:56, Larry Hastings
I have a straightforward question about the str object, specifically the PyUnicodeObject. I've tried reading the source to answer the question myself but it's nearly impenetrable. So I was hoping someone here who understands the current implementation could answer it for me.
Although the str object is immutable from Python's perspective, the C object itself is mutable. For example, for dynamically-created strings the hash field may be lazy-computed and cached inside the object. I was wondering if there were other fields like this. For example, are there similar lazy-computed cached objects for the different encoded versions (utf8 utf16) of the str? What would really help an exhaustive list of the fields of a str object that may ever change after the object's initial creation.
https://www.python.org/dev/peps/pep-0393/#specification should have most of the relevant details. Aside from the hash and the interned-or-not flag in the state, most things should be locked once the string is ready, except that generating the utf-8 and wchar_t forms is deferred until they're needed if they're not the same as the canonical form. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Le 9 avr. 2016 03:04, "Larry Hastings"
Although the str object is immutable from Python's perspective, the C object itself is mutable. For example, for dynamically-created strings the hash field may be lazy-computed and cached inside the object.
I was wondering if there were other fields like this. For example, are
What would really help an exhaustive list of the fields of a str object
Yes, the hash is computed once on demand. It doesn't matter how you build the string. there similar lazy-computed cached objects for the different encoded versions (utf8 utf16) of the str? Cached utf8 is only cached when you call the C functions filling this cache. The Python str.encode('utf8') doesn't fill the cache, but it uses it. On Windows, there is a cache for wchar_t* which is utf16. This format is used by all C functions of the Windows API (Python should only use the Unicode flavor of the Windows API). I don't recall other caches. that may ever change after the object's initial creation. I don't recall exactly what happens if a cache is created and then the string is modified. If I recall correctly, the cache is invalidated. But the hash is used as an heuristic to decide if a string is "immutable" or not, the refcount is also used by the heuristic. If the string is immutable, an operation like resize must create a new string. You can document the PEP 393 in Include/unicodeobject.h. Victor
On 09.04.16 10:52, Victor Stinner wrote:
Although the str object is immutable from Python's perspective, the C object itself is mutable. For example, for dynamically-created strings
Le 9 avr. 2016 03:04, "Larry Hastings"
mailto:larry@hastings.org> a écrit : the hash field may be lazy-computed and cached inside the object. Yes, the hash is computed once on demand. It doesn't matter how you build the string.
I was wondering if there were other fields like this. For example, are there similar lazy-computed cached objects for the different encoded versions (utf8 utf16) of the str?
Cached utf8 is only cached when you call the C functions filling this cache. The Python str.encode('utf8') doesn't fill the cache, but it uses it.
On Windows, there is a cache for wchar_t* which is utf16. This format is used by all C functions of the Windows API (Python should only use the Unicode flavor of the Windows API).
I don't recall other caches.
What would really help an exhaustive list of the fields of a str object that may ever change after the object's initial creation.
I don't recall exactly what happens if a cache is created and then the string is modified. If I recall correctly, the cache is invalidated.
You must remember, some bugs with desynchronized utf8 and wchar_t* caches were fixed just few months ago.
But the hash is used as an heuristic to decide if a string is "immutable" or not, the refcount is also used by the heuristic. If the string is immutable, an operation like resize must create a new string.
You can document the PEP 393 in Include/unicodeobject.h.
In normal case the string object can be mutated only at creation time. But CPython uses some tricks that modifies already created strings if they have no external references and are not interned. For example "a += b" or "a = a + b" can resize the "a" string.
2016-04-09 9:52 GMT+02:00 Victor Stinner
But the hash is used as an heuristic to decide if a string is "immutable" or not, the refcount is also used by the heuristic. If the string is immutable, an operation like resize must create a new string.
I'm talking about this private function: static int unicode_modifiable(PyObject *unicode) { assert(_PyUnicode_CHECK(unicode)); if (Py_REFCNT(unicode) != 1) return 0; if (_PyUnicode_HASH(unicode) != -1) return 0; if (PyUnicode_CHECK_INTERNED(unicode)) return 0; if (!PyUnicode_CheckExact(unicode)) return 0; #ifdef Py_DEBUG /* singleton refcount is greater than 1 */ assert(!unicode_is_singleton(unicode)); #endif return 1; } Victor
participants (4)
-
Larry Hastings
-
Nick Coghlan
-
Serhiy Storchaka
-
Victor Stinner