[Python-Dev] Rethinking intern() and its data structure

Thu Apr 9 21:59:02 CEST 2009

...

> I like your rationale (save memory) much more, and was asking in the
> tracker for specific numbers, which weren't forthcoming.
> 

...

> Now that you brought up a specific numbers, I tried to verify them,
> and found them correct (although a bit unfortunate), please see my
> test script below. Up to 21800 interned strings, the dict takes (only)
> 384kiB. It then grows, requiring 1536kiB. Whether or not having 22k
> interned strings is "typical", I still don't know.

Given that every variable name in any file is interned, it can grow
pretty rapidly. As an extreme case, consider the file
"win32/lib/winerror.py" which tracks all possible win32 errors.

>>> import winerror
>>> print len(winerror.__dict__)
1872

So a single error file has 1.9k strings.

My python version (2.5.2) doesn't have 'sys.getsizeof()', but otherwise
your code looks correct.

If all I do is find the interned dict, I see:
>>> print len(d)
5037

So stock python, without importing much extra (just os, sys, gc, etc.)
has almost 5k strings already.

I don't have a great regex yet for just extracting how many unique
strings there are in a given bit of source code.

However, if I do:

import gc, sys
def find_interned_dict():
    cand = None
    for o in gc.get_objects():
        if not isinstance(o, dict):
            continue
        if "find_interned_dict" not in o:
            continue
        for k,v in o.iteritems():
            if k is not v:
                break
        else:
            assert not cand
            cand = o
    return cand

d = find_interned_dict()
print len(d)

# Just import a few of the core structures
from bzrlib import branch, repository, workingtree, builtins
print len(d)

I start at 5k strings, and after just importing the important bits of
bzrlib, I'm at:
19,316

Now, the bzrlib source code isn't particularly huge. It is about 3.7MB /
91k lines of .py files (that is, without importing the test suite).

Memory consumption with just importing bzrlib shows up at 15MB, with
300kB taken up by the intern dict.

If I then import some extra bits of bzrlib, like http support, ftp
support, and sftp support (which brings in python's httplib, and
paramiko, and ssh/sftp implementation), I'm up to:
>>> print len(d)
25186

Memory has jumped to 23MB, (interned is now 1.57MB) and I haven't
actually done anything but import python code yet. If I sum the size of
the PyString objects held in intern() it ammounts to 940KB. Though they
refer to only 335KB of char data. (or an average of 13 bytes per string).

> 
> Wrt. your proposed change, I would be worried about maintainability,
> in particular if it would copy parts of the set implementation.

Right, so in the first part, I would just use Set(), as it could then
save 1/3rd of the memory it uses today. (Dropping down to 1MB from 1.5MB.)

I don't have numbers on how much that would improve CPU times, I would
imagine improving 'intern()' would impact import times more than run
times, simply because import time is interning a *lot* of strings.

Though honestly, Bazaar would really like this, because startup overhead
for us is almost 400ms to 'do nothing', which is a lot for a command
line app.

John
=:->