matching strings in a large set of strings

Peter Otten __peter__ at web.de
Thu Apr 29 06:06:05 EDT 2010


Karin Lagesen wrote:

> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
> 
> Now, I have tried storing these strings as a list, a set and a dictionary.
> I know that finding things in a set and a dictionary is very much faster
> than working with a list, so I tried those first. However, I run out of
> memory building both the set and the dictionary, so what I seem to be left
> with is the list, and using the in method.
> 
> I imagine that there should be a faster/better way than this?

Do you need all matches or do you just have to know whether there are any? 
Can the search string be shorter than 14 characters?

One simple improvement over the list may be using one big string instead of 
the 83 million short ones and then search it using string methods.

Peter



More information about the Python-list mailing list