[issue9520] Add Patricia Trie high performance container (python's defaultdict(int) is unusable on datasets with 10, 000, 000+ keys.)

Thu Aug 5 02:59:23 CEST 2010

New submission from Dmitry Chichkov <dchichkov at gmail.com>:

On large data sets (10-100 million keys) the default python dictionary implementation fails to meet memory and performance constraints. It also apparently fails to keep O(1) complexity (after just 1M keys). As such, there is a need for good, optimized, practical implementation of a high-performance container that can store such datasets. 

One of the alternatives is a regular Patricia Trie data structure. It can meet performance requirements on such datasets. Criteria:
* strictly O(1);
* works well on large datasets (~100M keys); memory efficient;
* same or better performance as dict();
* supports regular dict | counter interface;
* supports unicode keys;
* supports any characters in the keys;
* persistent (can be pickled); 
* in-memory library;

There are a few existing implementations available:
* BioPython trie - http://biopython.org/
* py-radix - http://www.mindrot.org/projects/py-radix/
* PyPi trie - http://pypi.python.org/pypi/trie

A few other relevant alternatives/implementations:
* NIST trie - http://www.itl.nist.gov/div897/sqg/dads/HTML/patriciatree.html
* C++ Trie Library - http://wikipedia-clustering.speedblue.org/trie.php
* PyAvl trie - http://pypi.python.org/pypi/pyavl/1.12_1
* PyJudy tree - http://www.dalkescientific.com/Python/PyJudy.html
* Redis - http://code.google.com/p/redis/
* Memcached - http://memcached.org/

An alternative to a basic Patricia Trie could be some state-of-the-art approach from some modern research (i.e. http://www.aclweb.org/anthology/W/W09/W09-1505.pdf ), 

The best existing implementation I've been able to find so far is one in the BioPython. Compared to defaultdict(int) on the task of counting words. Dataset 123,981,712 words (6,504,484 unique), 1..21 characters long:
* bio.tree - 459 Mb/0.13 Hours, good O(1) behavior
* defaultdict(int) - 693 Mb/0.32 Hours, poor, almost O(N) behavior

At 8,000,0000 keys python defaultdict(int) starts showing almost O(N) behavior and gets unusable with 10,000,000+ unique keys. 

A few usage/installatio notes on BioPython trie:
$ sudo apt-get install python-biopython

>>> from Bio import trie
>>> trieobj = trie.trie()
>>> trieobj["hello"] = 5
>>> trieobj["hello"] += 1
>>> print trieobj["hello"]
>>> print trieobj.keys()

More examples at:
http://python-biopython.sourcearchive.com/documentation/1.54/test__triefind_8py-source.html

----------
components: Library (Lib)
messages: 112937
nosy: dmtr
priority: normal
severity: normal
status: open
title: Add Patricia Trie high performance container (python's defaultdict(int) is unusable on datasets with 10,000,000+ keys.)
type: performance
versions: Python 3.1, Python 3.2, Python 3.3

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue9520>
_______________________________________