[issue9520] Add Patricia Trie high performance container

Antoine Pitrou report at bugs.python.org
Sat Aug 14 13:16:52 CEST 2010


Antoine Pitrou <pitrou at free.fr> added the comment:

> For example, on the x64 machine the following dict() mapping 
> 10,000,000 very short unicode keys (~7 chars) to integers eats 149
> bytes per entry. 

This is counting the keys too. Under 3.x:

>>> d = {}
>>> for i in range(0, 10000000): d[str(i)] = i
... 
>>> sys.getsizeof(d)
402653464
>>> sys.getsizeof(d) / len(d)
40.2653464

So, the dict itself uses ~40 bytes/entry. Since this is a 64-bit Python build, each entry uses three words of 8 bytes each, that is 24 bytes per entry (one word for the key, one word for the associated value, one word for the cached hash value). So, you see the ratio of allocated entries in the hash table over used entries is only a bit above 2, which is reasonable.

Do note that unicode objects themselves are not that compact:

>>> sys.getsizeof("1000000")
72

If you have many of them, you might use bytestrings instead:

>>> sys.getsizeof(b"1000000")
40

I've modified your benchmark to run under 3.x and will post it in a later message (I don't know whether bio.trie exists for 3.x, though).

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue9520>
_______________________________________


More information about the Python-bugs-list mailing list