matching strings in a large set of strings
Stefan Behnel
stefan_ml at behnel.de
Sat May 1 09:05:05 EDT 2010
Duncan Booth, 30.04.2010 10:20:
> So more than 3GB just for the strings (and that's for Python 2.x on
> Python 3.x you'll need nearly 5GB).
>
> Running on a 64 bit version of Python should be fine, but for a 32 bit
> system a naive approach just isn't going to work.
>
> Option 1: use a trie. That should reduce storage, maybe it will reduce
> it enough, maybe not. It depends on the data.
Depending on the implementation and the data, a trie can also use a lot
/more/ space than the set of strings that it contains. The 83 million 14
character strings can well include a relatively large subset of the
possible permutations (imagine purely numeric, hex-numeric or lower-case
alpha-numeric strings, for example), so the trie will likely need to branch
very often with very low intermediate run length. If you use pointers for
trie branches, that's at least 8 bytes per branch on a 64bit system, versus
1 byte per character in a byte string list. Depending on the ratio of
branches to characters, one or the other may win. So a "naive approach"
likely won't work for tries either.
Stefan
More information about the Python-list
mailing list