matching strings in a large set of strings

Fri Apr 30 04:20:12 EDT 2010

Paul Rudin <paul.nospam at rudin.co.uk> wrote:

> Shouldn't a set with 83 million 14 character strings be fine in memory
> on a stock PC these days? I suppose if it's low on ram you might start
> swapping which will kill performance. Perhaps the method you're using
> to build the data structures creates lots of garbage? How much ram do
> you have and how much memory does the python process use as it builds
> your data structures?

Some simple experiments should show you that a stock PC running a 32 bit 
Python will struggle:

>>> s = "12345678901234"
>>> sys.getsizeof(s)
38
>>> 83*38
3154

So more than 3GB just for the strings (and that's for Python 2.x on 
Python 3.x you'll need nearly 5GB).

Running on a 64 bit version of Python should be fine, but for a 32 bit 
system a naive approach just isn't going to work.

Option 1: use a trie. That should reduce storage, maybe it will reduce 
it enough, maybe not. It depends on the data.

Option 2: use a simple database. e.g. shelve. Simple to write and easy 
to use.

Option 3: use a linear search through the file containing the 83 million 
strings. If the OP really does want to check *one* string then that is 
comparatively slow but still much faster than posting the question here. 
If they really wanted to check say 10,000 strings then put those strings 
in a set and test each line of the 83 million line file against the set 
instead of doing it the other way round. At some number of test strings 
this is probably faster than using a database.

-- 
Duncan Booth http://kupuguy.blogspot.com