matching strings in a large set of strings

Albert van der Horst albert at spenarnc.xs4all.nl
Sun May 2 08:23:45 EDT 2010


In article <877hnpjtdw.fsf at rudin.co.uk>,
Paul Rudin  <paul.nospam at rudin.co.uk> wrote:
>"Karin Lagesen" <karin.lagesen at bio.uio.no> writes:
>
>> Hello.
>>
>> I have approx 83 million strings, all 14 characters long. I need to be
>> able to take another string and find out whether this one is present
>> within the 83 million strings.
>>
>> Now, I have tried storing these strings as a list, a set and a dictionary.
>> I know that finding things in a set and a dictionary is very much faster
>> than working with a list, so I tried those first. However, I run out of
>> memory building both the set and the dictionary, so what I seem to be left
>> with is the list, and using the in method.
>>
>> I imagine that there should be a faster/better way than this?
>>
>
>Shouldn't a set with 83 million 14 character strings be fine in memory
>on a stock PC these days? I suppose if it's low on ram you might start
>swapping which will kill performance. Perhaps the method you're using to
>build the data structures creates lots of garbage? How much ram do you
>have and how much memory does the python process use as it builds your
>data structures?

And if this is practical there should be no swapping problems,
as the working set will be a small fraction of the data used.

<SNIP>
>
>There are other algorithms you can use that have better theoretical
>performance than using bisect for this particular problem, but you get
>into trade offs between executing things in python as opposed to C if
>you start to hand code things in python.

There are a lot of possibility, but they depend a great deal on
secondary conditions, like how often the 83 M dataset changes.

Groetjes Albert

--
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert at spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst




More information about the Python-list mailing list