On Mon, Jan 24, 2022 at 2:15 AM Warren Weckesser <warren.weckesser@gmail.com> wrote:
Thanks Sebastian for pointing out that the issue is (minor) page
faults, and thanks Francesc for providing the links and the analysis
of the situation.  After reading those links and experimenting with
malloc in a couple C programs, it is clear now what is happening.  The
key factors to be aware of are:

* What the values of M_MMAP_THRESHOLD and M_TRIM_THRESHOLD are and how
they are updated dynamically.
* Minor page faults are triggered by new allocations from the system,
whether via mmap or via sbrk(). This means avoiding mmap is not enough
to avoid potential problems with minor page faults.
<snip>

Thanks for going down to the rabbit hole.  It is now much clearer what's going on.

If we set M_MMAP_THRESHOLD to value large enough that we never use
mmap, we get results like this (also noted by Francesc):

```
$ MALLOC_MMAP_THRESHOLD_=20000000 python timing_question.py
numpy version 1.20.3

 639.5976 microseconds
 641.1489 microseconds
 640.8228 microseconds

 385.0931 microseconds
 384.8406 microseconds
 384.0398 microseconds
```

In this case, setting the mmap threshold disables the dynamic updating
of all the thresholds, so M_TRIM_THRESHOLD remains 128*1024.  This
means each time free() is called, there will almost certainly be
enough memory at the end of the heap that exceeds M_TRIM_THRESHOLD and
so each iteration in timeit results memory being allocated from the
system and returned to the system via sbrk(), and that leads to minor
page faults when writing to the temporary arrays in each iterations.
That is, we still get thrashing at the top of the heap.  (Apparenly
not all the calls of free() lead to trimming the heap, so the
performance is better with the second z array.)

<snip>

This is still something that I don't get.  I suppose this is very minor, but it would be nice to see the reason for this.

At any rate, with what we know now, one can estimate the origin of the overhead for the first z array. Going back to the original timing:

 602.1964 microseconds
 601.0663 microseconds
 601.0728 microseconds

 139.5745 microseconds
 139.1559 microseconds
 138.9791 microseconds

The 460 microsecs (600 - 140) is the overhead of dealing with temporaries.  That means that, as there are around 1000 page faults, each page fault accounts for around 460 nanosecs.  Having into account that memory latency for this machine is around 80 nanosecs, most of the overhead must come for other source.  My (wild) guess is that this is the zeroing for each of the temporary that the OS makes for the anonymous mmap() that it has to make.

All in all, hats off to Warren for this entertaining piece of investigation!
--
Francesc Alted