[Tutor] Clarification questions about how Python uses references.
Cameron Simpson
cs at cskk.id.au
Thu Jun 24 19:12:56 EDT 2021
On 24Jun2021 12:36, boB Stepp <robertvstepp at gmail.com> wrote:
>On Thu, Jun 24, 2021 at 3:10 AM Cameron Simpson <cs at cskk.id.au> wrote:
>> They're related terms.
>
>> A hash function takes a source value (eg a string such as you might use
>> to index a dict) or, really, anything else, and computes a value,
>> normally an integer. It is always the same for a given source value.
>
>Then I suppose "MD5" and similar terms refer to the algorithm being
>employed to come up with this mapping of values.
Yes. Hash functions are one way (see your collision stuff below) -
because multiple source values can have the same hash value the hash
value does not reverse to the source value - notionally it would reverse
to many potential source values.
A cryptographic hash (MD, SHA in their flavours) are very strongly of
this form - not reversible, and a single bit change in the source value
would usually flip half the bits in the hash value (which is the same as
_randomly_ flipping or not flipping all the bits, which means the values
are very widely spread over their range). I say 'randomly" - obviously
not, but to an observer of _only_ the hash values there's not supposed
to be a pattern from which to make deductions.
>> The usual objective is that the various source values are evenly
>> distributed over some range.
>>
>> In the case of a "hash table", this is what gives dicts and sets O(1)
>> lookup times. When you store things in a hash table, the table has an
>> array of "slots" where things get stored according to their key.
>> You compute a hash value which dicates the slot a particular item is
>> stored in. Then to see if an item is present, you only need to to look
>> at the items stores in that particular slot, not at every item in the
>> table as a whole.
>>
>> For maximum speed the number of slots is around the number of items
>> stored, so that the number of items in a slot is around 1. Obviously
>> that depends a bit on the source values, and this is why hash functions
>> are chosen to produce evenly distributes values - to get a fairly even
>> spread of slot usage regardless of the source values.
>
>So if it should happen that more than one value can get assigned to a
>slot, this is what "hash collision" is referring to? I vaguely recall
>reading concerns over potential hashing collision issues. I don't
>recall what hashing algorithm that reading was talking about, but it
>seems it was a potential serious security concern.
It is, in the cryptographic world.
Let's talk hash tables first. You've got a table of slots indexed by the
hash function (or the hash function modulo the width of the table, to
size it). Obviously multiple source values can map to the same slot. For
good lookup performance (a) you size the hash table to be wide enough
for the number of values stored so that usually there are very few
entries per slot, given even distribution and (b) you want a hash
function which does a nice even distribution from typical source values.
Note that for a dict or set, this implies that as it grows the hash
table gets resized every so often to keep the number of slot entries
low. This amounts to making a new table, rerunning the hash for each key
modulo the new width, and putting the keys in new places. That also
means the collisions move around - if the main hash function gave 47 and
33 and they collided in a 5 slot table (47 % 7 == 5, 33 % 7 == 5), if
the table grew and we made 17 slots they wouldn't collide. But other
hash values would.
The collisions are why you want a hash function to have a nice even
spread of values - that spreads the source values across slots evenly
producing evenly shallow slows. Obviously you could invent pathological
sets of source keys that all collide for a particular hash function and
slot count. The objective is that typical use doesn't do that.
There is a thing called a "perfect hash function", which distributes
every source value to distinct slots. You need to know all the source
values to create one of them. But it has the property that (a) you have
no collisions or unused slots and (b) you can compare the hash values
instead of comparing the keys.
Because the slots are expected to have few members, you can put a very
simple variable sized data structure in each slot, eg a linked list,
because you expect it to be very short. Collisions make that bad, so we
try to avoid them.
Back to cryptography and why collisions are very important there.
"Cryptographic" hash functions have a heap of desireable features:
- very resistant to reversal - you should not be able to predict the
source value (or any potential source value) from the hash value,
meaning you can pass them around without fear of revealing source
values
- constant time and power (various crypto functions can be attacked when
their run time varies with their input data)
- very even distribution (see the bit flip feature above)
- a very wide result domain - even in heavy use you should not expect to
find two source values with the same hash value during the lifetime of
the universe
This last feature has a couple of implications:
We often use the hash as a proxy for the source key - you're already
guarenteed that different hash values imply different source values; all
hash functions have that property. But if collisions are astronomically
improbable, then if the hash values are the same you can infer very
reliably that the source values are the same (doesn't tell you the
source value, note). That is also the basis for cryptographic
signatures; if you hash a document and get the signature, you are very
sure that the document has not been tampered with.
And this is the second implication: if you _can_ find a pair of source
values with the same cryptographic hash value, the function is probably
broken for crypto purposes. For example, such a collision would allow
you to make a different document with the same signature, allowing
forgery. Etc.
MD4, Md5 and SHA1 are all now too weak. Some because of a weakness in
the algorithm (eg you can derive collisions, or at least make a targeted
search) or because of compute - you can do a brute force search with
enough resources.
The (statistical) isomorphic property of a cryptographic hash function
is also used for content based storage. If you want to deduplicate input
data to store only one copy of data blocks with the same content (eg in
a domain where that is common, eg a bunch of virtual machine images with
OSes stored in them, etc etc) you can make an index of blocks keyed by
their hash value. If the hash value's in the index (a) you know you've
got this block already and (b) you know where you put it. So you don't
need another copy. Deduplication achieved.
Cheers,
Cameron Simpson <cs at cskk.id.au>
More information about the Tutor
mailing list