Interning string subtype instances
I propose modifying PyString_InternInPlace to better cope with string subtype instances. Current implementation of PyString_InternInPlace does nothing and returns if passed an instance of a subtype of PyString_Type. This is a problem for applications that need to support string subtypes, but also must intern the strings for faster equivalence testing. Such an application, upon receiving a string subtype, will silently fail to work. There is good reason for PyString_InternInPlace not accepting string subtypes: since a subtype can have modified behavior, interning it can cause problems for other users of the interned string. I agree with the reasoning, but propose a different solution: when interning an instance of a string subtype, PyString_InternInPlace could simply intern a copy. This should be a fully backward compatible change because: 1) code that passes PyString instances (99.99% cases) will work as before, and 2) code that passes something else silently failed to intern the string anyway. Speed should be exactly the same as before, with the added benefit that interning PyString subtype instances now does something, but without the problems that interning the actual instance can produce. The patch could look like this. If there is interest in this, I can produce a complete patch. @@ -5,10 +5,6 @@ PyObject *t; if (s == NULL || !PyString_Check(s)) Py_FatalError("PyString_InternInPlace: strings only please!"); - /* If it's a string subclass, we don't really know what putting - it in the interned dict might do. */ - if (!PyString_CheckExact(s)) - return; if (PyString_CHECK_INTERNED(s)) return; if (interned == NULL) { @@ -25,6 +21,18 @@ *p = t; return; } + /* Make sure we don't intern a string subclass, since we don't + really know what putting it in the interned dict might do. */ + if (!PyString_CheckExact(s)) { + PyObject *copy; + copy = PyString_FromStringAndSize(PyString_AS_STRING(*p), + PyString_GET_SIZE(*p)); + if (!copy) + return; + Py_DECREF(*p); + *p = copy; + s = (PyStringObject *) copy; + } if (PyDict_SetItem(interned, (PyObject *)s, (PyObject *)s) < 0) { PyErr_Clear();
On 2/12/07, Hrvoje Nikšić <hrvoje.niksic@avl.com> wrote:
cause problems for other users of the interned string. I agree with the reasoning, but propose a different solution: when interning an instance of a string subtype, PyString_InternInPlace could simply intern a copy.
Interning currently requires an external reference to prevent garbage collection (I believe). What will hold a reference to the string copy? -Mike
Mike Klaas schrieb:
cause problems for other users of the interned string. I agree with the reasoning, but propose a different solution: when interning an instance of a string subtype, PyString_InternInPlace could simply intern a copy.
Interning currently requires an external reference to prevent garbage collection (I believe). What will hold a reference to the string copy?
For PyString_InternInPlace, the caller will receive a reference; the original object passed in is decref'ed. A typical caller of PyString_InternInPlace will just store the reference in some global/static variable. builtin_intern will return it to the caller, which then needs to store it. This was always the case with intern(); the usage pattern is foo = intern(foo) as intern may return a different object (e.g. if the string was already interned). Regards, Martin
Hrvoje Nikic <hrvoje.niksic@avl.com> wrote:
I propose modifying PyString_InternInPlace to better cope with string subtype instances.
Any particular reason why the following won't work for you? _interned = {} def intern(st): #add optional type checking here if st not in _interned: _interned[st] = st return _interned[st] If I remember the implementation of intern correctly, that's more or less what happens under the covers. - Josiah
Josiah Carlson wrote:
def intern(st): ...
If I remember the implementation of intern correctly, that's more or less what happens under the covers.
That doesn't quite give you everything that real interning does, though. The string comparison method knows when both strings are interned, so it can compare them quickly whether they are equal or not. Your version could detect equal strings quickly, but not unequal strings. -- Greg
Greg Ewing schrieb:
That doesn't quite give you everything that real interning does, though. The string comparison method knows when both strings are interned, so it can compare them quickly whether they are equal or not.
No, it doesn't - see stringobject.c:string_richcompare. If they are identical, they are quickly recognized as equal, and as not not-equal. Otherwise, if eq comparison is requested, it compares the size, then the first characters, then does memcmp - so if they are unequal, it doesn't help that they are both interned. If comparison is for notequal, no special casing is performed at all (except that knowing that identical strings aren't notequal). The entire algorithm and optimization works just as fine for a user-defined interning dictionary (assuming that all relevant strings get into it). Regards, Martin
Martin v. Löwis wrote:
Greg Ewing schrieb:
The string comparison method knows when both strings are interned
No, it doesn't - see stringobject.c:string_richcompare.
Well, I'm surprised and confused. It's certainly possible to tell very easily whether a string is interned -- there's a PyString_CHECK_INTERNED macro that tests a field in the string object header. But, I can't find anywhere that it's used in the core, apart from the interning code itself and the string alloc/dealloc code. Can anyone shed any light on this? It seems to me that by not using this information, only half the benefit of interning is being achieved. -- Greg
Greg Ewing schrieb:
It's certainly possible to tell very easily whether a string is interned -- there's a PyString_CHECK_INTERNED macro that tests a field in the string object header. But, I can't find anywhere that it's used in the core, apart from the interning code itself and the string alloc/dealloc code.
Can anyone shed any light on this? It seems to me that by not using this information, only half the benefit of interning is being achieved.
You seem to assume that checking this property would improve performance. I find that questionable; as always with performance and optimization, one cannot know until it has been benchmarked. My guess is that - the cases you want to cover are comparably infrequent (compared to the total number of richcompare invocations) - checking the additional field always may cause a performance loss, not a performance gain, because of the additional tests. If you want to sure, you need to make some measurements (e.g: what is the percentage of EQ and NE invocations, in how many cases are both strings interned, and in what percentage of these cases are they not the same string). Regards, Martin
Can anyone shed any light on this? It seems to me that by not using this information, only half the benefit of interning is being achieved. If I understand your question correctly, you're saying "why doesn't string comparison take advantage of interned strings?" If so, the answer is "it does". Examine string_richcompare() in stringobject.c, and PyUnicode_compare() in unicodeobject.c. Both functions compare the
Greg Ewing wrote: pointers of the two objects in case they are literally the same object, and if so skip all the expensive comparison code. Interned strings which have the same value are likely to be (are guaranteed to be?) the same object. And there you are. If that's not the question you are asking, I'll be quiet, 'cause it seems like other people understand what you're talking about. Cheers, /larry/
Larry Hastings schrieb:
If I understand your question correctly, you're saying "why doesn't string comparison take advantage of interned strings?" If so, the answer is "it does". Examine string_richcompare() in stringobject.c, and PyUnicode_compare() in unicodeobject.c. Both functions compare the pointers of the two objects in case they are literally the same object, and if so skip all the expensive comparison code. Interned strings which have the same value are likely to be (are guaranteed to be?) the same object. And there you are.
If that's not the question you are asking, I'll be quiet, 'cause it seems like other people understand what you're talking about.
The question was similar: "Why doesn't it take the maximum advantage?" Interned strings that are equal are guaranteed to be identical. So if strings are not identical and both interned, it could infer that they can't be equal, but currently doesn't infer so (instead, it compares length and contents in this case). Regards, Martin
Larry Hastings wrote:
If I understand your question correctly, you're saying "why doesn't string comparison take advantage of interned strings?"
No, I understand that it takes advantage of it when the strings are equal. I was wondering about the case where they're not equal. But as has been pointed out, unless you're very unlucky, you find out pretty fast when they're not equal. So I'm happy now. -- Greg
Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Josiah Carlson wrote:
def intern(st): ...
If I remember the implementation of intern correctly, that's more or less what happens under the covers.
That doesn't quite give you everything that real interning does, though. The string comparison method knows when both strings are interned, so it can compare them quickly whether they are equal or not. Your version could detect equal strings quickly, but not unequal strings.
Assuming that dictionaries and the hash algorithm for strings is not hopelessly broken, I believe that one discovers quite quickly when two strings are not equal. Simple testing seems to confirm this, but I didn't work too hard to disprove it either. - Josiah
Josiah Carlson wrote:
Assuming that dictionaries and the hash algorithm for strings is not hopelessly broken, I believe that one discovers quite quickly when two strings are not equal.
You're probably right, since if there's a hash collision, the colliding strings probably differ in the first char or two. Interning will speed up the case where they are equal, where otherwise you would have to examine every character. Still, there could be extension modules out there somewhere that make use of PyString_CHECK_INTERNED, so there's a slight chance that home-grown interning might not give the same results in all cases. -- Greg
Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Josiah Carlson wrote:
Assuming that dictionaries and the hash algorithm for strings is not hopelessly broken, I believe that one discovers quite quickly when two strings are not equal.
You're probably right, since if there's a hash collision, the colliding strings probably differ in the first char or two.
Interning will speed up the case where they are equal, where otherwise you would have to examine every character.
Except that the scanning of the data needs to happen at some point anyways. You first allocate the new string, copy the data into it, compare the content against something in the interned dictionary, and if there is a match, deallocate the new object and return the interned object, otherwise return the new object. This is also the case when one uses the "just give me a string, I'll fill in the data myself" for more or less user-implemented things like ''.join(lst).
Still, there could be extension modules out there somewhere that make use of PyString_CHECK_INTERNED, so there's a slight chance that home-grown interning might not give the same results in all cases.
Perhaps. I just never intern. I've found that I'm much happier assuming that the Python runtime will do good enough. - Josiah
On Mon, 2007-02-12 at 12:29 -0800, Josiah Carlson wrote:
Hrvoje Nikšic <hrvoje.niksic@avl.com> wrote:
I propose modifying PyString_InternInPlace to better cope with string subtype instances.
Any particular reason why the following won't work for you? [... snipped a simple intern implementation ...]
Because Python internals don't use that; they call PyString_InternInPlace directly. Another reason is that Python's interning mechanism is much better than such a simple implementation: it stores the interned state directly in the PyString_Object structure, so you can find out that a string is already interned without looking it up in the dictionary. This information can (and is) used by both Python core and by C extensions. Another advantage is that, as of recently, interned strings can be garbage collected, which is typically not true of simple replacements (although it could probably be emulated by using weak references, it's not trivial.)
Hrvoje Nikšić schrieb:
Another reason is that Python's interning mechanism is much better than such a simple implementation: it stores the interned state directly in the PyString_Object structure, so you can find out that a string is already interned without looking it up in the dictionary. This information can (and is) used by both Python core and by C extensions. Another advantage is that, as of recently, interned strings can be garbage collected, which is typically not true of simple replacements (although it could probably be emulated by using weak references, it's not trivial.)
OTOH, in an application that needs unique strings, you normally know what the scope is (i.e. where the strings come from, and when they aren't used anymore). For example, in XML parsing, pyexpat supports an interning dictionary. It puts all element and attribute names into (but not element content, which typically isn't likely to be repeated). It starts with a fresh dictionary before parsing starts, and releases the dictionary when parsing is done. Regards, Martin
Martin v. Löwis wrote:
OTOH, in an application that needs unique strings, you normally know what the scope is (i.e. where the strings come from, and when they aren't used anymore).
That's true -- if you know that all relevant strings have been interned using the appropriate method, you can just use 'is' to compare them. -- Greg
On Wed, 2007-02-07 at 15:39 +0100, Hrvoje Nikšić wrote:
The patch could look like this. If there is interest in this, I can produce a complete patch.
The complete patch is now available on SourceForge: http://sourceforge.net/tracker/index.php?func=detail&aid=1658799&group_id=5470&atid=305470
participants (6)
-
"Martin v. Löwis" -
Greg Ewing -
Hrvoje Nikšić -
Josiah Carlson -
Larry Hastings -
Mike Klaas