removing duplication from a huge list.

Fri Feb 27 12:54:21 EST 2009

2009/2/27 Steve Holden <steve at holdenweb.com>:

> Assuming no duplicates, how does this help? You still have to verify
> collisions.
>
>> Pretty brutish and slow, but it's the first algorithm which comes to
>> mind. Of course, I'm assuming that the list items are long enough to
>> warrant using a hash and not the values themselves.
>>
> The problem is you can't guarantee any hash value will be unique, so
> ultimately you have to check whether list entries hashing to the same
> value are or are not equal. Which would mean either having the whole
> list in memory or having access to it via some other random access method.

You don't need random access to the whole list, just to the records
that collide on a hash. although you do end up using the external
storage as random access, which can be very slow, you've possibly
drastically reduced the number of such random accesses, which should
be a big saving. The best case is if the number of collisions is low
enough for the random access to be dealt with in buffering, avoiding
wasteful seeks of the external storage, in which case it's a complete
win. Remember that if the identical hashes are not collisions but
genuine duplicates you can throw them away as soon as they're checked,
so with some clever buffering you can stop them from clogging up the
buffer. The worst case is if there are a lot of genuine collisions, in
which case it's probably not a very good hash.

-- 
Tim Rowe