[Python-ideas] Re: Set operations with Lists

20 Sep 2019

      ...
On Thu, Sep 19, 2019 at 10:25:20PM -0500, Tim Peters wrote:
[...]
...
Something I haven't seen mentioned here, but I may have missed it:
when timing algorithms with sets and dicts, people often end up merely
measuring the bad effects of hardware data cache misses when the
containers get large, and especially in contrived benchmarks.
In those the base data tends to get built up in a contiguous range of
memory, and storing the objects in a list leaves traversing the list
fetching the objects' data in the same contiguous memory order.  It's
as cache-friendly as can be.
This is an interesting point, which is difficult to deduce without an
implementation insight. I would for example assume that there is a "list
construct" which holds just the references to the actual objects (strings
in our example here) and the same for the dict (hashmap), so the actual
objects could be anywhere, while the "list" or "dict" can be in one
(memory) place. Did you mean that, or that also the object themselves are
in the "same place"?

I believe there are two factors which may affect the performance. The cache
miss on actual data objects (e.g. the strings), and the cache miss on the
"construct" object (i.e. the list or dict). But I have no idea how big are
lists or dicts (of the same size), or even if all those assumptions above
are correct.

On Fri, Sep 20, 2019 at 10:43 AM Steven D'Aprano <steve@pearwood.info>
wrote:
...
I'm not sure if this is something specific to timing tests, or if it
will hold for real-world data in non-contrived code too.
The only thing which could impact it on the "real data" is an execution
environment, i.e. for example, if the cores/threads are loaded and the
operations get preemtped by the scheduler (which would also invalidate the
cache). Then depending on the frequency of those invalidation it could make
the importance of the "fit into cache" condition less important. But it
would depend on the system scheduler and the platform configuration (may
differ on the server and on the normal desktop machine). But technically
the difference should not be bigger on the real data.

Richard

[Python-ideas] Re: Set operations with Lists

Richard Musil