Internals of interning strings
jstok at bluedog.apana.org.au
Fri Mar 24 05:36:42 CET 2000
If I do the following:
>>> a = "A completely new string that I haven't used before"
>>> b = a
>>> c = intern(a)
>>> d = "A completely new string that I haven't used before"
>>> e = intern(d)
>>> a is b
>>> b is c
>>>a is e
>>>e is d
>From reading the sources, I know the interpreter does the following:
Internally, calls PyString_InternInPlace(PyObject** p). PyObject** p is an
out parameter that is set to the pointer of an interned string. In the
first call to "intern", the string referred to by the name "a" hasn't been
interned before, so it's placed in the "interned" dictionary and itself
returned as the result of "intern". The string object referred to by "a"
has, internally, an ob_sinterned field. This is set to point to itself,
indicating that it is an interned value.
On the next call to intern, the string referred to by the name "d" has the
same value as that referred to by "a". When we go to intern it, we find the
value of the string referred to by d is already present in the dictionary.
The string referred to by a is returned as the result of the function.
Also, the interpreter sets the internal field "ob_sinterned" of the object
referred to by d to *also* point to a. Now, anywhere the object referred to
by d is used certain operations can be slightly optimized. If you invoke
intern on the object referred to by d again, the PyString_InternInPlace
routine sees that its "ob_sinterned" field already points to an object, and
returns that, instead of looking it up in the dictionary again. And if "d"
is hashed, the hash function returns the cached hash value of the object
currently pointed to by "a".
I don't know if that's clear, but I didn't want to include the whole source
listing. Anyway, the question is: is this the only reason for the extra
entry "ob_sinterned" in the PyString struct? That is, a couple of
optimisations, costing an extra 4 bytes per string object?
More information about the Python-list