On Thu, Sep 19, 2019 at 10:25:20PM -0500, Tim Peters wrote: [...]
Something I haven't seen mentioned here, but I may have missed it: when timing algorithms with sets and dicts, people often end up merely measuring the bad effects of hardware data cache misses when the containers get large, and especially in contrived benchmarks.
In those the base data tends to get built up in a contiguous range of memory, and storing the objects in a list leaves traversing the list fetching the objects' data in the same contiguous memory order. It's as cache-friendly as can be.
This is an interesting point, which is difficult to deduce without an implementation insight. I would for example assume that there is a "list construct" which holds just the references to the actual objects (strings in our example here) and the same for the dict (hashmap), so the actual objects could be anywhere, while the "list" or "dict" can be in one (memory) place. Did you mean that, or that also the object themselves are in the "same place"? I believe there are two factors which may affect the performance. The cache miss on actual data objects (e.g. the strings), and the cache miss on the "construct" object (i.e. the list or dict). But I have no idea how big are lists or dicts (of the same size), or even if all those assumptions above are correct. On Fri, Sep 20, 2019 at 10:43 AM Steven D'Aprano <steve@pearwood.info> wrote:
I'm not sure if this is something specific to timing tests, or if it will hold for real-world data in non-contrived code too.
The only thing which could impact it on the "real data" is an execution environment, i.e. for example, if the cores/threads are loaded and the operations get preemtped by the scheduler (which would also invalidate the cache). Then depending on the frequency of those invalidation it could make the importance of the "fit into cache" condition less important. But it would depend on the system scheduler and the platform configuration (may differ on the server and on the normal desktop machine). But technically the difference should not be bigger on the real data. Richard