matching strings in a large set of strings

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Fri Apr 30 10:41:10 CEST 2010


On Fri, 30 Apr 2010 08:23:39 +0100, Paul Rudin wrote:

> "Karin Lagesen" <karin.lagesen at bio.uio.no> writes:
> 
>> Hello.
>>
>> I have approx 83 million strings, all 14 characters long. I need to be
>> able to take another string and find out whether this one is present
>> within the 83 million strings.
>>
>> Now, I have tried storing these strings as a list, a set and a
>> dictionary. I know that finding things in a set and a dictionary is
>> very much faster than working with a list, so I tried those first.
>> However, I run out of memory building both the set and the dictionary,
>> so what I seem to be left with is the list, and using the in method.
>>
>> I imagine that there should be a faster/better way than this?
>>
>>
> Shouldn't a set with 83 million 14 character strings be fine in memory
> on a stock PC these days? 

Not even close. Using Python 2.6:

>>> s = "12345678901234"
>>> assert len(s) == 14
>>> import sys
>>> sys.getsizeof(s)
38

So a single 14 char string takes 38 bytes.


>>> import random, string
>>> chars = list(string.letters + string.digits)*4
>>> def rnd_str():
...     random.shuffle(chars)
...     return ''.join(chars[:14])
...
>>> s = set()
>>> while len(s) < 83000:
...     s.add(rnd_str())
...
>>> sys.getsizeof(s)
1048688


So a set with 83000 such strings takes approximately 1 MB. So far fairly 
trivial. But that's just the memory used by the container (the set), not 
the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course 
is trivial for a modern PC, but the OP is using 83 million such strings, 
not 83 thousand, which gives us a grand total of at least 3 gigabytes. An 
entry level desktop PC these days is generally 2GB, and entry level 
notebooks might have half a gig.

If the OP is on a 64 bit system, every pointer will be twice the size, 
leading to even larger memory requirements. And if the strings are 
unicode, then potentially they could be anything up to four times larger 
each. Worst case, the OP could need something of the order of 24 GB to 
store the strings all in memory.


-- 
Steven



More information about the Python-list mailing list