Orders of magnitude

Christian Tismer tismer at stackless.com
Mon Mar 29 21:53:13 CEST 2004

Christian Tismer wrote:

> Robert Brewer wrote:
>> I wrote:
>> I'm dedup'ing a 10-million-record dataset, trying different approaches
>> for building indexes. The in-memory dicts are clearly faster, but I get
>> Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
>> other ways to build a large index without slowing down by a factor of
>> 25?
>> ...and got replies:
> Here is another one.
> If you believe in statistics, then you might consider
> to build a dictionary of SHA digests, only. The chance
> that you get a hash clash is very, very small, if you use
> the sha 160 bit algorith as in the sha module.

A slight change to the idea which goes rather easily
with standard Python and very small memory, if you have
the disk space:

Phase one opens, say, 16 files for writing.
In a big loop, you calculate the sha key for every
record, and then you form an index from the first 4 bits
of the key. Then you write the record into its file.

Then you have 16 files with no duplicates between them.
You now process every of them using the sha algorithm, again,
and you have no memory problem at all.

ciao - chris

Christian Tismer             :^)   <mailto:tismer at stackless.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship* http://starship.python.net/
14109 Berlin                 :     PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34  home +49 30 802 86 56  mobile +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?   http://www.stackless.com/

More information about the Python-list mailing list