[Python-Dev] String interning

Martin v. Loewis martin@v.loewis.de
01 Jul 2002 09:03:22 +0200


Oren Tirosh <oren-py-d@hishome.net> writes:

> In stringobject.c most references to ob_sinterned are to initialize it. The
> only place that uses it is string_hash:  if ob_sinterned is not NULL it uses 
> the hash of the string it points to instead of the current string object. 

This is not true: PyString_InternInPlace has

	if ((t = s->ob_sinterned) != NULL) {

which checks whether the string being interned had been interned
before.

> Summary: As far as I can tell, indirectly interned strings are redundant. 
> Without them the ob_sinterned field is effectively a boolean flag.
> 
> Can anyone explain why interning is implemented the way it is?  Can anyone
> explain why Mac/Python/macimport.c is messing with ob_sinterned?

I'm not sure what meaning you would assiocate with the boolean
flag. If this is meant to denote "this is an interned string", then

	if ((t = s->ob_sinterned) != NULL) {
		if (t == (PyObject *)s)
			return;

would become

        if (s->ob_isinterned) return;

To see the difference, I added

	if ((t = s->ob_sinterned) != NULL) {
		if (t == (PyObject *)s)
			return;
		fprintf(stderr, "reinterning\n");

If that code prints "reinterning", it can efficiently intern the
argument, but couldn't with your change.

I agree that this is very rare, but in the test suite, it triggers 5
times in test_descr.

> The size of all string objects can be reduced by 3 bytes.

That is not true. Taking a 32-bit architecture, and considering that
each string has 16 bytes minimum storage (without ob_sinterned), and
taking into account the 8-byte clustering of pymalloc, we get

stringsize  current-storage  new-storage  savings
0           24               24           0
1           24               24           0
2           24               24           0
3           24               24           0
4           32               24           8
5           32               24           8
6           32               24           8
7           32               32           0

So the size reduction depends on the actual length of the strings;
it's 3 bytes only on average, assuming a uniform distribution of
string sizes.

Regards,
Martin