matching strings in a large set of strings
Mark Tolonen
metolone+gmane at gmail.com
Thu Apr 29 07:00:43 EDT 2010
"Karin Lagesen" <karin.lagesen at bio.uio.no> wrote in message
news:416f727c6f5b0edb932b425db9579808.squirrel at webmail.uio.no...
> Hello.
>
> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
>
> Now, I have tried storing these strings as a list, a set and a dictionary.
> I know that finding things in a set and a dictionary is very much faster
> than working with a list, so I tried those first. However, I run out of
> memory building both the set and the dictionary, so what I seem to be left
> with is the list, and using the in method.
>
> I imagine that there should be a faster/better way than this?
You may be overthinking it. How fast does it need to be? I generated the
following file:
>>> f=open('test.txt','wt')
>>> for i in xrange(83000000):
... f.write('0123456789ABCD\n')
...
>>> f.close()
It took about 15 seconds to search that file with a command line search tool
(Windows, and a slow IDE hard drive):
findstr xyz test.txt
It took about twice that to search via Python with 'in':
>>> for line in open('test.txt'):
... if 'xyz' in line:
... print line
...
Reading only a line at a time has the advantage of using very little memory.
Storing 83 million 14-character strings in memory requires a minimum of
about 1.2GB not counting overhead for lists/sets/dictionaries.
-Mark
More information about the Python-list
mailing list