matching strings in a large set of strings

Fri Apr 30 03:23:39 EDT 2010

"Karin Lagesen" <karin.lagesen at bio.uio.no> writes:

> Hello.
>
> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
>
> Now, I have tried storing these strings as a list, a set and a dictionary.
> I know that finding things in a set and a dictionary is very much faster
> than working with a list, so I tried those first. However, I run out of
> memory building both the set and the dictionary, so what I seem to be left
> with is the list, and using the in method.
>
> I imagine that there should be a faster/better way than this?
>

Shouldn't a set with 83 million 14 character strings be fine in memory
on a stock PC these days? I suppose if it's low on ram you might start
swapping which will kill performance. Perhaps the method you're using to
build the data structures creates lots of garbage? How much ram do you
have and how much memory does the python process use as it builds your
data structures?

A set should give good performance if the target string is also 14
characters.

If you do end up with the list then try using bisect
<http://docs.python.org/library/bisect.html> which should be quicker
than  just using "in" (which I think just scans linearly through the list
testing for equality).

There are other algorithms you can use that have better theoretical
performance than using bisect for this particular problem, but you get
into trade offs between executing things in python as opposed to C if
you start to hand code things in python.