matching strings in a large set of strings
Duncan Booth
duncan.booth at invalid.invalid
Thu Apr 29 10:40:14 EDT 2010
MRAB <python at mrabarnett.plus.com> wrote:
> Karin Lagesen wrote:
>> Hello.
>>
>> I have approx 83 million strings, all 14 characters long. I need to
>> be able to take another string and find out whether this one is
>> present within the 83 million strings.
>>
>> Now, I have tried storing these strings as a list, a set and a
>> dictionary. I know that finding things in a set and a dictionary is
>> very much faster than working with a list, so I tried those first.
>> However, I run out of memory building both the set and the
>> dictionary, so what I seem to be left with is the list, and using the
>> in method.
>>
>> I imagine that there should be a faster/better way than this?
>>
> You could sort the list and then use a binary chop (Google can tell
> you what that is if you don't know already).
>
... and Python comes batteries included. You can use bisect.bisect_left to
do the binary chop and then you just have to check whether it found the
string.
--
Duncan Booth http://kupuguy.blogspot.com
More information about the Python-list
mailing list