matching strings in a large set of strings
duncan.booth at invalid.invalid
Thu Apr 29 16:40:14 CEST 2010
MRAB <python at mrabarnett.plus.com> wrote:
> Karin Lagesen wrote:
>> I have approx 83 million strings, all 14 characters long. I need to
>> be able to take another string and find out whether this one is
>> present within the 83 million strings.
>> Now, I have tried storing these strings as a list, a set and a
>> dictionary. I know that finding things in a set and a dictionary is
>> very much faster than working with a list, so I tried those first.
>> However, I run out of memory building both the set and the
>> dictionary, so what I seem to be left with is the list, and using the
>> in method.
>> I imagine that there should be a faster/better way than this?
> You could sort the list and then use a binary chop (Google can tell
> you what that is if you don't know already).
... and Python comes batteries included. You can use bisect.bisect_left to
do the binary chop and then you just have to check whether it found the
Duncan Booth http://kupuguy.blogspot.com
More information about the Python-list