roy at panix.com
Mon May 15 15:11:21 CEST 2006
In article <1147699064.107490 at teuthos>, Chris Foote <chris at foote.com.au>
> Hi all.
> I have the need to store a large (10M) number of keys in a hash table,
> based on a tuple of (long_integer, integer). The standard python
> dictionary works well for small numbers of keys, but starts to
> perform badly for me inserting roughly 5M keys:
> # keys dictionary metakit (both using psyco)
> ------ ---------- -------
> 1M 8.8s 22.2s
> 2M 24.0s 43.7s
> 5M 115.3s 105.4s
Are those clock times or CPU times?
How much memory is your process using and how much is available on your
I'm guessing a integer takes 4 bytes and a long integer takes roughly one
byte per two decimal digits. Plus a few more bytes to bundle them up into
a tuple. You've probably got something like 20 bytes per key, so 5M of
them is 100 meg just for the keys.
To get reasonable hashing performance, the hash table needs to be maybe
half full, and figure a hash key is 4 bytes, so that's another 40 meg for
the hash table itself.
Plus whatever the values stored in the dictionary take up. Even if you're
not storing any values (i.e., there's probably 4 bytes for a null pointer
(or reference to None), so that's another 40 meg.
These are all vague guesses, based on what I think are probably
conservative estimates of what various objects must take up in memory, but
you see where this is going. We're already up to 180 meg. I wonder if the
whole problem is that you're just paging yourself to death. A few minutes
watching your system memory performance with ps or top while your program
is running might give you some idea if this is the case.
More information about the Python-list