matching strings in a large set of strings
steve at REMOVE-THIS-cybersource.com.au
Fri Apr 30 10:41:10 CEST 2010
On Fri, 30 Apr 2010 08:23:39 +0100, Paul Rudin wrote:
> "Karin Lagesen" <karin.lagesen at bio.uio.no> writes:
>> I have approx 83 million strings, all 14 characters long. I need to be
>> able to take another string and find out whether this one is present
>> within the 83 million strings.
>> Now, I have tried storing these strings as a list, a set and a
>> dictionary. I know that finding things in a set and a dictionary is
>> very much faster than working with a list, so I tried those first.
>> However, I run out of memory building both the set and the dictionary,
>> so what I seem to be left with is the list, and using the in method.
>> I imagine that there should be a faster/better way than this?
> Shouldn't a set with 83 million 14 character strings be fine in memory
> on a stock PC these days?
Not even close. Using Python 2.6:
>>> s = "12345678901234"
>>> assert len(s) == 14
>>> import sys
So a single 14 char string takes 38 bytes.
>>> import random, string
>>> chars = list(string.letters + string.digits)*4
>>> def rnd_str():
... return ''.join(chars[:14])
>>> s = set()
>>> while len(s) < 83000:
So a set with 83000 such strings takes approximately 1 MB. So far fairly
trivial. But that's just the memory used by the container (the set), not
the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course
is trivial for a modern PC, but the OP is using 83 million such strings,
not 83 thousand, which gives us a grand total of at least 3 gigabytes. An
entry level desktop PC these days is generally 2GB, and entry level
notebooks might have half a gig.
If the OP is on a 64 bit system, every pointer will be twice the size,
leading to even larger memory requirements. And if the strings are
unicode, then potentially they could be anything up to four times larger
each. Worst case, the OP could need something of the order of 24 GB to
store the strings all in memory.
More information about the Python-list