matching strings in a large set of strings
Duncan Booth
duncan.booth at invalid.invalid
Fri Apr 30 04:20:12 EDT 2010
Paul Rudin <paul.nospam at rudin.co.uk> wrote:
> Shouldn't a set with 83 million 14 character strings be fine in memory
> on a stock PC these days? I suppose if it's low on ram you might start
> swapping which will kill performance. Perhaps the method you're using
> to build the data structures creates lots of garbage? How much ram do
> you have and how much memory does the python process use as it builds
> your data structures?
Some simple experiments should show you that a stock PC running a 32 bit
Python will struggle:
>>> s = "12345678901234"
>>> sys.getsizeof(s)
38
>>> 83*38
3154
So more than 3GB just for the strings (and that's for Python 2.x on
Python 3.x you'll need nearly 5GB).
Running on a 64 bit version of Python should be fine, but for a 32 bit
system a naive approach just isn't going to work.
Option 1: use a trie. That should reduce storage, maybe it will reduce
it enough, maybe not. It depends on the data.
Option 2: use a simple database. e.g. shelve. Simple to write and easy
to use.
Option 3: use a linear search through the file containing the 83 million
strings. If the OP really does want to check *one* string then that is
comparatively slow but still much faster than posting the question here.
If they really wanted to check say 10,000 strings then put those strings
in a set and test each line of the 83 million line file against the set
instead of doing it the other way round. At some number of test strings
this is probably faster than using a database.
--
Duncan Booth http://kupuguy.blogspot.com
More information about the Python-list
mailing list