[Tutor] check against multiple variables

Fri Jul 20 11:01:16 CEST 2012

On 19 July 2012 23:20, Steven D'Aprano <steve at pearwood.info> wrote:
> Selby Rowley-Cannon wrote:
>>
>> I am using a hash table in a small randomization program. I know that some
>> hash functions can be prone to collisions, so I need a way to detect
>> collisions.

> This entire question seems like a remarkable case of premature optimization.
> Start with demonstrating that collisions are an actual problem that need
> fixing.

Unless you know all the keys in advance and have consructed a proven
perfect hash, potential collisions are something that *any* hash table
implementation fundamentally *has* to deal with, no matter how small
the probability of a collision.  (If there's a non-zero probability of
collision you *have* to deal with collisions.)

> Unless you have profiled your application and proven that hash collisions is
> a real problem -- and unless you are hashing thousands of float NANs, that
> is almost certainly not the case -- you are just wasting your time and
> making your code slower rather than faster -- a pessimation, not
> optimization.
>
> And if it *is* a problem, then the solution is to fix your data so that its
> __hash__ method is less likely to collide. If you are rolling your own hash
> method, instead of using one of Python's, that's your first problem.

Unfortunately you usually can't prescribe the data to be stored in a
hash table to accomodate a given hashing function.  ;)  You can
however of course improve your hashing function, or (within reason)
increase the hash table size.

But yes, rolling your own is not a good idea given highly optimized
solutions already exist inside the Python language (dict)  Except of
course if you're studying hash tables as part of some course...

Selby, can you clarify please, are you doing this as part of an
assignment or some coursework?  I can understand your question in the
context of learning/computer studies, but otherwise you should just
use Python's dict.

In any case, to somewhat answer you question, dealing with collisions
in hash tables typically involves some variant of one of the following
2 approaches: 1) Each slot in the hash table is a "bucket", usually
implemented as a list (but might be something else), so you do a
search through the hash bucket to see if the key you'r looking for is
in fact in the "bucket" or 2) You have each slot in the hash table be
a single key/value pair and then once you've hashed you look to see if
the key at the entry identified is the key you're looking for.  If not
(collision), you hash ("probe") again by some factor (modulo the size
of you hash table) and look at the next slot identified etc until you
either find the key you're looking for, or an empty slot.  There many
variations one could come up with.  Have a read of this page on
wikipedia: http://en.wikipedia.org/wiki/Hash_table

Walter