matching strings in a large set of strings
lists at cheimes.de
Fri Apr 30 11:10:03 CEST 2010
>>>> s = "12345678901234"
>>>> assert len(s) == 14
>>>> import sys
> So a single 14 char string takes 38 bytes.
Make that at least 40 bytes. You have to take memory alignment into account.
> So a set with 83000 such strings takes approximately 1 MB. So far fairly
> trivial. But that's just the memory used by the container (the set), not
> the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course
> is trivial for a modern PC, but the OP is using 83 million such strings,
> not 83 thousand, which gives us a grand total of at least 3 gigabytes. An
> entry level desktop PC these days is generally 2GB, and entry level
> notebooks might have half a gig.
You are pretty much screwed on a 32bit system here. In my experience
32bit system can't store more than 2.5 to 2.8 GB on the heap. Eventually
malloc() will fail since large amounts of the 4 GB address space is
reserved for other things like stack, entry point, shared library
mappings, error detection etc. Memory fragmentation isn't an issue here.
* use a SQL database with an index on the data column. The index could
optimize the "starting with" case.
* You need to search for a string inside a large set of texts? Sounds
like a job for a fulltext search machine! Grab PyLucene and index your
data inside a lucene database. A SSD helps alot here.
More information about the Python-list