[Python-Dev] Rethinking intern() and its data structure

John Arbash Meinel john.arbash.meinel at gmail.com
Fri Apr 10 06:38:55 CEST 2009

>> Somewhat true, though I know it happens 25k times during startup of
>> bzr... And I would be a *lot* happier if startup time was 100ms instead
>> of 400ms.
> I don't want to quash your idealism too severely, but it is extremely
> unlikely that you are going to get anywhere near that kind of speed up
> by tweaking string interning.  25k times doing anything (computation)
> just isn't all that much.
> $ python -mtimeit -s 'd=dict.fromkeys(xrange(10000000))' 'for x in
> xrange(25000): d.get(x)'
> 100 loops, best of 3: 8.28 msec per loop
> Perhaps this isn't representative (int hashing is ridiculously cheap,
> for instance), but the dict itself is far bigger than the dict you are
> dealing with and such would have similar cache-busting properties.  And
> yet, 25k accesses (plus python->c dispatching costs which you are paying
> with interning) consume only ~10ms.  You could do more good by
> eliminating a handful of disk seeks by reducing the number of imported
> modules...
> -Mike

You're also using timeit over the same set of 25k keys, which means it
only has to load that subset. And as you are using identical runs each
time, those keys are already loaded into your cache lines... And given
how hash(int) works, they are all sequential in memory, and all 10M in
your original set have 0 collisions. Actually, at 10M, you'll have a
dict of size 20M entries, and the first 10M entries will be full, and
the trailing 10M entries will all be empty.

That said, you're right, the benefits of a smaller structure are going
to be small. I'll just point that if I just do a small tweak to your
timing and do:

$ python -mtimeit -s 'd=dict.fromkeys(xrange(10000000))' 'for x in
  xrange(25000): d.get(x)'
100 loops, best of 3: 6.27 msec per loop

So slightly faster than yours, *but*, lets try a much smaller dict:

$ python -mtimeit -s 'd=dict.fromkeys(xrange(25000))' 'for x in
  xrange(25000): d.get(x)'
100 loops, best of 3: 6.35 msec per loop

Pretty much the same time. Well within the noise margin. But if I go
back to the "big dict" and actually select 25k keys across the whole set:

$ TIMEIT -s 'd=dict.fromkeys(xrange(10000000));' \
 -s keys=range(0,10000000,10000000/25000)' \
 'for x in keys: d.get(x)'
100 loops, best of 3: 13.1 msec per loop

Now I'm still accessing 25k keys, but I'm doing it across the whole
range, and suddenly the time *doubled*.

What about slightly more random access:
$ TIMEIT -s 'import random; d=dict.fromkeys(xrange(10000000));'
	-s 'bits = range(0, 10000000, 400); random.shuffle(bits)'\
 	'for x in bits: d.get(x)'
100 loops, best of 3: 15.5 msec per loop

Not as big of a difference as I thought it would be... But I bet if
there was a way to put the random shuffle in the inner loop, so you
weren't accessing the same identical 25k keys internally, you might get
more interesting results.

As for other bits about exercising caches:

$ shuffle(range(0, 10000000, 400))
100 loops, best of 3: 15.5 msec per loop

$ shuffle(range(0, 10000000, 40))
10 loops, best of 3: 175 msec per loop

10x more keys, costs 11.3x, pretty close to linear.

$ shuffle(range(0, 10000000, 10))
10 loops, best of 3: 739 msec per loop

4x the keys, 4.5x the time, starting to get more into nonlinear effects.

Anyway, you're absolutely right. intern() overhead is a tiny fraction of
'import bzrlib.*' time, so I don't expect to see amazing results. That
said, accessing 25k keys in a smaller structure is 2x faster than
accessing 25k keys spread across a larger structure.


More information about the Python-Dev mailing list