matching strings in a large set of strings

Paul Rudin paul.nospam at rudin.co.uk
Fri Apr 30 04:50:42 EDT 2010


Duncan Booth <duncan.booth at invalid.invalid> writes:

> Paul Rudin <paul.nospam at rudin.co.uk> wrote:
>
>> Shouldn't a set with 83 million 14 character strings be fine in memory
>> on a stock PC these days? I suppose if it's low on ram you might start
>> swapping which will kill performance. Perhaps the method you're using
>> to build the data structures creates lots of garbage? How much ram do
>> you have and how much memory does the python process use as it builds
>> your data structures?
>
> Some simple experiments should show you that a stock PC running a 32 bit 
> Python will struggle:
>
>>>> s = "12345678901234"
>>>> sys.getsizeof(s)
> 38
>>>> 83*38
> 3154
>
> So more than 3GB just for the strings (and that's for Python 2.x on 
> Python 3.x you'll need nearly 5GB).
>
> Running on a 64 bit version of Python should be fine, but for a 32 bit 
> system a naive approach just isn't going to work.

It depends - a few gig of RAM can be cheap compared with programmer
time. If you know you can solve a problem by spending a few euros on
some extra RAM it can be a good solution! It depends of course where the
code is being deployed - if it's something that's intended to be
deployed widely then you can't expect thousands of people to go out and
buy more RAM - but if it's a one off deployment for a particular
environment then it can be the best way to go.




More information about the Python-list mailing list