[Richard Higginbotham
If you start including the time it takes to convert lists to sets its even more pronounced. hash values can collide and the bigger the data set the more likely it is to happen.
That's not so. Python's hash tables dynamically resize so that the load factor never exceeds 2/3. The worst case then is a "large" table as full as it's allowed to get before Python grows its size. In that case, the average number of probes for a successful search is 1.65, and for a failing search 3.00. It doesn't matter to this how large the tables get. For very small tables, the average values are lower, but 1.65/3.00 is good to the full 3 digits already at tables with 4096 slots (which can hold at most 2730 items). Of course there are pathological cases, but they're exceedingly unlikely to occur when using strings as keys, because Python's string hashes are a good approximation to (pseudo)random.
Perfect hash functions don't exist.
Of course they do, but it's impractically expensive to construct a perfect hash function for dynamically changing data. In Python, it so happens that when dict keys (or sets) consist of a contiguous range of integers (range(J. J+N) for some J and N), the integer hash function usually IS "perfect": no collisions at all. That's intentional and unusual, and follows in part from that hash(J) == J for "not huge" positive integers J. Something I haven't seen mentioned here, but I may have missed it: when timing algorithms with sets and dicts, people often end up merely measuring the bad effects of hardware data cache misses when the containers get large, and especially in contrived benchmarks. In those the base data tends to get built up in a contiguous range of memory, and storing the objects in a list leaves traversing the list fetching the objects' data in the same contiguous memory order. It's as cache-friendly as can be. Traversing via a set or dict instead reads the objects' data in pseudo-random memory order instead, which can give a very high rate of cache misses. In addition, the internal hash table probes also occur in pseudo-random order, adding yet another layer of cache misses. Plain old lists are in fact pleasant in many ways :-)