[Tutor] Clarification questions about how Python uses references.

Thu Jun 24 19:12:56 EDT 2021

On 24Jun2021 12:36, boB Stepp <robertvstepp at gmail.com> wrote:
>On Thu, Jun 24, 2021 at 3:10 AM Cameron Simpson <cs at cskk.id.au> wrote:
>> They're related terms.
>
>> A hash function takes a source value (eg a string such as you might use
>> to index a dict) or, really, anything else, and computes a value,
>> normally an integer. It is always the same for a given source value.
>
>Then I suppose "MD5" and similar terms refer to the algorithm being
>employed to come up with this mapping of values.

Yes. Hash functions are one way (see your collision stuff below) - 
because multiple source values can have the same hash value the hash 
value does not reverse to the source value - notionally it would reverse 
to many potential source values.

A cryptographic hash (MD, SHA in their flavours) are very strongly of 
this form - not reversible, and a single bit change in the source value 
would usually flip half the bits in the hash value (which is the same as 
_randomly_ flipping or not flipping all the bits, which means the values 
are very widely spread over their range). I say 'randomly" - obviously 
not, but to an observer of _only_ the hash values there's not supposed 
to be a pattern from which to make deductions.

>> The usual objective is that the various source values are evenly
>> distributed over some range.
>>
>> In the case of a "hash table", this is what gives dicts and sets O(1)
>> lookup times. When you store things in a hash table, the table has an
>> array of "slots" where things get stored according to their key.
>> You compute a hash value which dicates the slot a particular item is
>> stored in. Then to see if an item is present, you only need to to look
>> at the items stores in that particular slot, not at every item in the
>> table as a whole.
>>
>> For maximum speed the number of slots is around the number of items
>> stored, so that the number of items in a slot is around 1. Obviously
>> that depends a bit on the source values, and this is why hash functions
>> are chosen to produce evenly distributes values - to get a fairly even
>> spread of slot usage regardless of the source values.
>
>So if it should happen that more than one value can get assigned to a
>slot, this is what "hash collision" is referring to?  I vaguely recall
>reading concerns over potential hashing collision issues.  I don't
>recall what hashing algorithm that reading was talking about, but it
>seems it was a potential serious security concern.

It is, in the cryptographic world.

Let's talk hash tables first. You've got a table of slots indexed by the 
hash function (or the hash function modulo the width of the table, to 
size it). Obviously multiple source values can map to the same slot. For 
good lookup performance (a) you size the hash table to be wide enough 
for the number of values stored so that usually there are very few 
entries per slot, given even distribution and (b) you want a hash 
function which does a nice even distribution from typical source values.

Note that for a dict or set, this implies that as it grows the hash 
table gets resized every so often to keep the number of slot entries 
low. This amounts to making a new table, rerunning the hash for each key 
modulo the new width, and putting the keys in new places. That also 
means the collisions move around - if the main hash function gave 47 and 
33 and they collided in a 5 slot table (47 % 7 == 5, 33 % 7 == 5), if 
the table grew and we made 17 slots they wouldn't collide. But other 
hash values would.

The collisions are why you want a hash function to have a nice even 
spread of values - that spreads the source values across slots evenly 
producing evenly shallow slows. Obviously you could invent pathological 
sets of source keys that all collide for a particular hash function and 
slot count. The objective is that typical use doesn't do that.

There is a thing called a "perfect hash function", which distributes 
every source value to distinct slots. You need to know all the source 
values to create one of them. But it has the property that (a) you have 
no collisions or unused slots and (b) you can compare the hash values 
instead of comparing the keys.

Because the slots are expected to have few members, you can put a very 
simple variable sized data structure in each slot, eg a linked list, 
because you expect it to be very short. Collisions make that bad, so we 
try to avoid them.

Back to cryptography and why collisions are very important there.

"Cryptographic" hash functions have a heap of desireable features:
- very resistant to reversal - you should not be able to predict the 
  source value (or any potential source value) from the hash value,
  meaning you can pass them around without fear of revealing source 
  values
- constant time and power (various crypto functions can be attacked when 
  their run time varies with their input data)
- very even distribution (see the bit flip feature above)
- a very wide result domain - even in heavy use you should not expect to 
  find two source values with the same hash value during the lifetime of 
  the universe

This last feature has a couple of implications:

We often use the hash as a proxy for the source key - you're already 
guarenteed that different hash values imply different source values; all 
hash functions have that property. But if collisions are astronomically 
improbable, then if the hash values are the same you can infer very 
reliably that the source values are the same (doesn't tell you the 
source value, note). That is also the basis for cryptographic 
signatures; if you hash a document and get the signature, you are very 
sure that the document has not been tampered with.

And this is the second implication: if you _can_ find a pair of source 
values with the same cryptographic hash value, the function is probably 
broken for crypto purposes. For example, such a collision would allow 
you to make a different document with the same signature, allowing 
forgery. Etc.

MD4, Md5 and SHA1 are all now too weak. Some because of a weakness in 
the algorithm (eg you can derive collisions, or at least make a targeted 
search) or because of compute - you can do a brute force search with 
enough resources.

The (statistical) isomorphic property of a cryptographic hash function 
is also used for content based storage. If you want to deduplicate input 
data to store only one copy of data blocks with the same content (eg in 
a domain where that is common, eg a bunch of virtual machine images with 
OSes stored in them, etc etc) you can make an index of blocks keyed by 
their hash value. If the hash value's in the index (a) you know you've 
got this block already and (b) you know where you put it. So you don't 
need another copy. Deduplication achieved.

Cheers,
Cameron Simpson <cs at cskk.id.au>