[issue20446] ipaddress: hash similarities for ipv4 and ipv6

Sat Jun 21 03:03:57 CEST 2014

Josh Rosenberg added the comment:

Correct me if I'm wrong, but wouldn't this only become a concern if:

1. You're storing both IPv4 and IPv6 addresses side-by-side
2. You're storing well over a billion IP addresses
3. Hash codes for the hex string of an IP address were predictably sequential (they're not)

On point #3 alone, you can check for yourself. In a quick test within a single process on Python 3.4, hash(hex(0x1)) == 7060637827927985012 while hash(hex(0x2)) == -4275917786525356978 (your numbers may vary thanks to per process string hash seeding, but they should be quite random). As such, you couldn't easily fill more than two sequential buckets reliably; you could guarantee collision chaining occurs at least once (since as you noted, you can create two IP addresses with the same hash reliably), but the chains will be evenly distributed; you can't build on that to get a second, third, ..., nth collision.

There wouldn't be a meaningful "imbalance" between low and high IP addresses either; below a billion or so IP addresses, random chance would dictate the occasional hash code would collide, and you could guarantee that collisions with the sub-32 bit values would collide one extra time before finding an empty bucket, but I seem to recall a typical dict insertion involves 1-5 collisions already; adding one extra to every single dict insertion/lookup costs something, but it's not that much, and the scenarios required to take advantage of it would be incredibly contrived.

----------
nosy: +josh.rosenberg

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20446>
_______________________________________