matching strings in a large set of strings

Duncan Booth duncan.booth at invalid.invalid
Thu Apr 29 10:40:14 EDT 2010


MRAB <python at mrabarnett.plus.com> wrote:

> Karin Lagesen wrote:
>> Hello.
>> 
>> I have approx 83 million strings, all 14 characters long. I need to
>> be able to take another string and find out whether this one is
>> present within the 83 million strings.
>> 
>> Now, I have tried storing these strings as a list, a set and a
>> dictionary. I know that finding things in a set and a dictionary is
>> very much faster than working with a list, so I tried those first.
>> However, I run out of memory building both the set and the
>> dictionary, so what I seem to be left with is the list, and using the
>> in method. 
>> 
>> I imagine that there should be a faster/better way than this?
>> 
> You could sort the list and then use a binary chop (Google can tell
> you what that is if you don't know already).
> 

... and Python comes batteries included. You can use bisect.bisect_left to 
do the binary chop and then you just have to check whether it found the 
string.


-- 
Duncan Booth http://kupuguy.blogspot.com



More information about the Python-list mailing list