[Python-ideas] Re: Set operations with Lists

20 Sep 2019

      [Richard Higginbotham ]
...
If you start including the time it takes to convert lists to sets its even more
pronounced. hash values can collide and the bigger the data set the
more likely it is to happen.
That's not so.  Python's hash tables dynamically resize so that the
load factor never exceeds 2/3.

The worst case then is a "large" table as full as it's allowed to get
before Python grows its size.  In that case, the average number of
probes for a successful search is 1.65, and for a failing search 3.00.
It doesn't matter to this how large the tables get.

For very small tables, the average values are lower, but 1.65/3.00 is
good to the full 3 digits already at tables with 4096 slots (which can
hold at most 2730 items).

Of course there are pathological cases, but they're exceedingly
unlikely to occur when using strings as keys, because Python's string
hashes are a good approximation to (pseudo)random.
...
Perfect hash functions don't exist.
Of course they do, but it's impractically expensive to construct a
perfect hash function for dynamically changing data.

In Python, it so happens that when dict keys (or sets) consist of a
contiguous range of integers (range(J. J+N) for some J and N), the
integer hash function usually IS "perfect":  no collisions at all.
That's intentional and unusual, and follows in part from that hash(J)
== J for "not huge" positive integers J.

Something I haven't seen mentioned here, but I may have missed it:
when timing algorithms with sets and dicts, people often end up merely
measuring the bad effects of hardware data cache misses when the
containers get large, and especially in contrived benchmarks.

In those the base data tends to get built up in a contiguous range of
memory, and storing the objects in a list leaves traversing the list
fetching the objects' data in the same contiguous memory order.  It's
as cache-friendly as can be.

Traversing via a set or dict instead reads the objects' data in
pseudo-random memory order instead, which can give a very high rate of
cache misses.  In addition, the internal hash table probes also occur
in pseudo-random order, adding yet another layer of cache misses.

Plain old lists are in fact pleasant in many ways :-)

[Python-ideas] Re: Set operations with Lists

Tim Peters