[Python-Dev] Rethinking intern() and its data structure

Thu Apr 9 19:05:24 CEST 2009

John Arbash Meinel wrote:
> When I looked at the actual references from interned, I saw mostly
> variable names. Considering that every variable goes through the python
> intern dict. And when you look at the intern function, it doesn't use
> setdefault logic, it actually does a get() followed by a set(), which
> means the cost of interning is 1-2 lookups depending on likelyhood, etc.
> (I saw a whole lot of strings as the error codes in win32all /
> winerror.py, and windows error codes tend to be longer-than-average
> variable length.)

I've read your posting twice but I'm still not sure if you are aware of
the most important feature of interned strings. In the first place
interning not about saving some bytes of memory but a speed
optimization. Interned strings can be compared with a simple and fast
pointer comparison. With interend strings you can simple write:

    char *a, *b;
    if (a == b) {
        ...
    }

Instead of:

    char *a, *b;
    if (strcmp(a, b) == 0) {
        ...
    }

A compiler can optimize the pointer comparison much better than a
function call.

> Anyway, I the internals of intern() could be done a bit better. Here are
> some concrete things:
> 
>   a) Don't keep a double reference to both key and value to the same
>      object (1 pointer per entry), this could be as simple as using a
>      Set() instead of a dict()
> 
>   b) Don't cache the hash key in the set, as strings already cache them.
>      (1 long per entry). This is a big win for space, but would need to
>      be balanced against lookup and collision resolving speed.
> 
>      My guess is that reducing the size of the set will actually improve
>      speed more, because more items can fit in cache. It depends on how
>      many times you need to resolve a collision. If the string hash is
>      sufficiently spread out, and the load factor is reasonable, then
>      likely when you actually find an item in the set, it will be the
>      item you want, and you'll need to bring the string object into
>      cache anyway, so that you can do a string comparison (rather than
>      just a hash comparison.)
> 
>   c) Use the existing lookup function one time. (PySet->lookup())
>      Sets already have a "lookup" which is optimized for strings, and
>      returns a pointer to where the object would go if it exists. Which
>      means the intern() function can do a single lookup resolving any
>      collisions, and return the object or insert without doing a second
>      lookup.
> 
>   d) Having a special structure might also allow for separate optimizing
>      of things like 'default size', 'grow rate', 'load factor', etc. A
>      lot of this could be tuned specifically knowing that we really only
>      have 1 of these objects, and it is going to be pointing at a lot of
>      strings that are < 50 bytes long.
> 
>      If hashes of variable name strings are well distributed, we could
>      probably get away with a load factor of 2. If we know we are likely
>      to have lots and lots that never go away (you rarely *unload*
>      modules, and all variable names are in the intern dict), that would
>      suggest having a large initial size, and probably a wide growth
>      factor to avoid spending a lot of time resizing the set.

I agree that a dict is not the most memory efficient data structure for
interned strings. However dicts are extremely well tested and highly
optimized. Any specialized data structure needs to be desinged and
tested very carefully. If you happen to break the interning system it's
going to lead to rather nasty and hard to debug problems.

>   e) How tuned is String.hash() for the fact that most of these strings
>      are going to be ascii text? (I know that python wants to support
>      non-ascii variable names, but I still think there is going to be an
>      overwhelming bias towards characters in the range 65-122 ('A'-'z').

Python 3.0 uses unicode for all names. You have to design something that
can be adopted to unicode, too. By the way do you know that dicts have
an optimized lookup function for strings? It's called lookdict_unicode /
 lookdict_string.

> Also note that the performance of the "interned" dict gets even worse on
> 64-bit platforms. Where the size of a 'dictentry' doubles, but the
> average length of a variable name wouldn't change.
> 
> Anyway, I would be happy to implement something along the lines of a
> "StringSet", or maybe the "InternSet", etc. I just wanted to check if
> people would be interested or not.

Since interning is mostly used in the core and extension modules you
might want to experiment with a different growth rate. The interning
data structure could start with a larger value and have a slower, non
progressive data growth rate.

Christian