Orders of magnitude

Mon Mar 29 20:45:55 EST 2004

Robert Brewer wrote:

> Christian Tismer wrote:
> 
>>Robert Brewer wrote:
>>
>>>I'm dedup'ing a 10-million-record dataset, trying different 
>>>approaches
>>>for building indexes. The in-memory dicts are clearly 
>>>faster, but I get
>>>Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any 

>>So, here we go.
>>The attached script processes a gigabyte of data, one
>>million records, in about a minute on my machine, and
>>finds the single duplicate.
> 
> Thanks, Christian! Looks like someone's been having fun. ;)

Hey, I love to help. And this is no fun, this is my life.
Had to do urgent paid work, today, but this really got me
interested. :-)
Please use the second file on your problem, it is already ready
for your use case.

> But I don't see how your solution avoids the problem, however (some code
> snipped):
> 
>     def process_duplicates(self):
>         for i in range(len(self.files)):
>             seen = {}
>             has = seen.has_key
>             for k in range(self.flushes):
>                 bin = pickle.load(self.files[i])
>                 for key, value in bin:
>                     if has(key):
>                         self.duplicates += 1
>                     else:
>                         seen[key] = value

Just try it, it works!

Here why it works (actually, you don't use my most recent source):

The first step processes all the hash keys and distrbutes them
over the first byte.

That means, the first byte categorizes things into 256 different
bins. If any of those keys are equal, then their first byte
must be equal, as well. The simple trick is to group all keys with
the same starting byte into the same file.

Here the simple, simple rule that you need to swallow:
If two records have the same hash, since they are possibly
a duplicate, then these two hashes will be identical.
And if these two hashes are identical, then their first byte
will be identical, too.

By grouping all hashes with the same first byte into bins,
I can make sure that a duplicates can only exist, if and only
iff they are in the same bin.

In other words, it is necessary and sufficient to just check
each bin alone for duplicates.

That means, I don't have to look into all keys at the same time,
when I've built the bins. It is completely sufficient to look into one 
bin at a time. Please see the above algorithm, it starts over with a
fresh "seen" dict for every bin. The second version of the algo does
the same thing, just with the same single file, which is just a
conglomerate of 256 files with disjoint starting positions of pickles.
But basically, it is the same.

> ...that is, the 'seen' dictionary is what is growing too large on my
> dataset; it ends up being about 7 million items. Actually, I have three
> such dictionaries (3 different fields which I'm checking for dupes for
> each record). The hashing might cut these sizes down a bit, but I think
> I'm still going to end up having to cache 'seen' (which I call 'indices'
> in my Directory.produce() method) somehow.

Your dict won't grow, since it is 256 disjoint dicts, now.
It is always a new seen dict, for every bin.
Note: The bins are disjoint by contruction.

And I have an additional algorithm on stock, which I did 20 years
ago, on sorting almost infinetly sized datasets, without prior
knowledge of the total record count. I can apply it here as well.
The idea is then to take more and more bytes into account, running
over the first file, and distributing subkeys into a second file.
This can be repeated ad infinitum, with two files of fixed size.
For your (simple, sorry) problem, one additional pass is sufficient.

The only difference in this case is that I don't have to do
full sorting, but just equality tests. This is why hashing is so
efficient. I didn't know this at that time and used full sorting
(Knuth's snow-plow like thing, with a priority queue as a multi-
branch tree-of-loosers sorter ...)

ciao - chris (stackless, but not only...)

p.s.: yes, I come from discrete math, but I'd like to see more IT
students getting deeper into it, since there is a huge gap,
nowadays.

-- 
Christian Tismer             :^)   <mailto:tismer at stackless.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship* http://starship.python.net/
14109 Berlin                 :     PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34  home +49 30 802 86 56  mobile +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?   http://www.stackless.com/