[Tim]
Something I haven't seen mentioned here, but I may have missed it: when timing algorithms with sets and dicts, people often end up merely measuring the bad effects of hardware data cache misses when the containers get large, and especially in contrived benchmarks.
In those the base data tends to get built up in a contiguous range of memory, and storing the objects in a list leaves traversing the list fetching the objects' data in the same contiguous memory order. It's as cache-friendly as can be.
[Steven D'Aprano
I'm not sure if this is something specific to timing tests, or if it will hold for real-world data in non-contrived code too.
Depends on the data access patterns of the app. There's really no simple way to know. One possible clue: in recent CPythons, sys._debugmallocstats() can be called to get info about the state of Python's small object allocator. Compare the "# bytes in allocated blocks" output line to the total number of bytes across all arenas. If that ratio is close to 1.0 then at least we're _using_ most of arena memory. The closer to 0 it gets, the more data is splattered around, with large gaps between used bytes. Since CPython _never_ moves an object in memory, fragmentation can become nearly arbitrarily bad. As bad as using only 8 bytes out of each 256 KiB allocated (although I expect it would take effort to contrive a program in the same universe as being that bad).
If it holds for real data, then presumably it counts as as advantage of lists over sets. But if it is an artifact of naive/contrived timing tests, is there something we can do to make the tests less naive?
What's the first rule of benchmarking? RUN ON A MACHINE AS QUIET AS POSSIBLE So it's deliberately and spectacularly naive. Benchmark results should really be viewed as lower bounds: guarantees that no related real-life code will ever run faster, and almost always slower ;-) They remain the easiest way to guess whether one approach "is likely to" run faster than another, though. An easy example of major cache effects: xs = list(range(1000000)) sum(xs) Then do the same, but use random.shuffle(xs) before summing. Just timing the `sum(xs)` part, a quick timeit run on my non-quiet box just now showed the latter taking well over 4x longer. The original spelling tends to access RAM (for the guts of the int objects) in sequential order, but shuffle first and sum() leaps all over RAM to get from one int to the next. The memory layout is the same either way, so staring at obmalloc stats won't yield any clue in that case.