[Python-Dev] Rethinking intern() and its data structure

John Arbash Meinel john.arbash.meinel at gmail.com
Thu Apr 9 20:20:11 CEST 2009

Alexander Belopolsky wrote:
> On Thu, Apr 9, 2009 at 11:02 AM, John Arbash Meinel
> <john at arbash-meinel.com> wrote:
> ...
>>  a) Don't keep a double reference to both key and value to the same
>>     object (1 pointer per entry), this could be as simple as using a
>>     Set() instead of a dict()
> There is a rejected patch implementing just that:
> http://bugs.python.org/issue1507011 .

Thanks for the heads up.

So reading that thread, the final reason it was rejected was 2 part:

  Without reviewing the patch again, I also doubt it is capable of
  getting rid of the reference count cheating: essentially, this
  cheating enables the interning dictionary to have weak references to
  strings, this is important to allow automatic collection of certain
  interned strings. This feature needs to be preserved, so the cheating
  in the reference count must continue.

That specific argument was invalid. Because the patch just changed the
refcount trickery to use +- 1. And I'm pretty sure Alexander's argument
was just that +- 2 was weird, not that the "weakref" behavior was bad.

The other argument against the patch was based on the idea that:
  The operation "give me the member equal but not identical to E" is
  conceptually a lookup operation; the mathematical set construct has no
  such operation, and the Python set models it closely. IOW, set is
  *not* a dict with key==value.

I don't know if there was any consensus reached on this, since only
Martin responded this way.

I can say that for my "do some work with a medium size code base", the
overhead of "interned" as a dictionary was 1.5MB out of 20MB total memory.

Simply changing it to a Set would drop this to 1.0MB. I have no proof
about the impact on performance, since I haven't benchmarked it yet.

Changing it to a StringSet could further drop it to 0.5MB. I would guess
that any performance impact would depend on whether the total size of
'interned' would fit inside L2 cache or not.

There is a small bug in the original patch adding the string to the set
failed. Namely it would return "t == NULL" which would be "t != s" and
the intern in place would end up setting your pointer to NULL rather
than doing nothing and clearing the error code.

So I guess some of it comes down to whether "loweis" would also reject
this change on the basis that mathematically a "set is not a dict".
Though given that his claim "nobody else is speaking in favor of the
patch", while at least Colin Winter has expressed some interest at this


More information about the Python-Dev mailing list