matching strings in a large set of strings
Bryan
bryanjugglercryptographer at yahoo.com
Mon May 3 11:53:18 EDT 2010
Karin Lagesen wrote:
> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
[...]
> I run out of memory building both the set and the dictionary, so
> what I seem to be left with is the list
I agree with the recommendations to try modules that maintain the set
on disk (sqlite, dbm, shelve), but here's an in-memory solution: Hash
to about 8 million buckets, and use a compact representation for the
several strings in each bucket. That gives you most of the speed and
avoids most of the memory overhead. Python makes it easy:
class ConstLenStrSet:
""" Keep a set of strings all of the same length.
Support set.add() and Python's 'in' test.
"""
def __init__(self, strlen=14, nbuckets=8000000):
assert strlen > 0 and nbuckets > 0
self.strlen = strlen
self.nbuckets = nbuckets | 1
self.table = [''] * self.nbuckets
def _hashstr(self, s):
return hash(s) % self.nbuckets
def add(self, s):
assert len(s) == self.strlen
if s not in self:
self.table[self._hashstr(s)] += s
def __contains__(self, s):
data = self.table[self._hashstr(s)]
for i in range(0, len(data), self.strlen):
if data[i : i + self.strlen] == s:
return True
return False
# Rudimentary demo-test:
from random import choice
from string import letters
def rnd_str(slen=14):
return ''.join(choice(letters) for _ in range(slen))
# Test against Python's built-in set
sref = set()
for i in range(830000):
sref.add(rnd_str())
print('Strings generated.')
sset = sset = ConstLenStrSet(14, 80000)
for s in sref:
sset.add(s)
print 'ConstLenStrSet loaded.'
for s in sref:
assert s in sset
for i in range(1000):
s = rnd_str()
if s in sref:
print 'That was unlucky.'
else:
assert s not in sset
If building the set is too slow, and you know you don't have a lot of
duplicate strings, you can use a faster insert method that doesn't
check whether the string is already in the set:
def add_quick(self, s):
assert len(s) == self.strlen
self.table[self._hashstr(s)] += s
--
--Bryan
More information about the Python-list
mailing list