[Python-Dev] Rethinking intern() and its data structure

John Arbash Meinel john.arbash.meinel at gmail.com
Thu Apr 9 22:22:04 CEST 2009


Martin v. Löwis wrote:
>> I don't have numbers on how much that would improve CPU times, I would
>> imagine improving 'intern()' would impact import times more than run
>> times, simply because import time is interning a *lot* of strings.
>>
>> Though honestly, Bazaar would really like this, because startup overhead
>> for us is almost 400ms to 'do nothing', which is a lot for a command
>> line app.
> 
> Maybe I misunderstand your proposed change: how could the representation
> of the interning dict possibly change the runtime of interning? (let
> alone significantly)
> 
> Regards,
> Martin
> 

Decreasing memory consumption lets more things fit in cache. Once the
size of 'interned' is greater than fits into L2 cache, you start paying
the cost of a full memory fetch, which is usually measured in 100s of
cpu cycles.

Avoiding double lookups in the dictionary would be less overhead, though
the second lookup is probably pretty fast if there are no collisions,
since everything would already be in the local CPU cache.

If we were dealing in objects that were KB in size, it wouldn't matter.
But as the intern dict quickly gets into MB, it starts to make a bigger
difference.

How big of a difference would be very CPU and dataset size specific. But
certainly caches make certain things much faster, and once you overflow
a cache, performance can take a surprising turn.

So my primary goal is certainly a decrease of memory consumption. I
think it will have a small knock-on effect of improving performance, I
don't have anything to give concrete numbers.

Also, consider that resizing has to evaluate every object, thus paging
in all X bytes, and assigning to another 2X bytes. Cutting X by
(potentially 3), would probably have a small but measurable effect.

John
=:->


More information about the Python-Dev mailing list