On Mon, Jan 24, 2022 at 11:01 AM Francesc Alted <faltet@gmail.com> wrote:
On Mon, Jan 24, 2022 at 2:15 AM Warren Weckesser <warren.weckesser@gmail.com> wrote:
Thanks Sebastian for pointing out that the issue is (minor) page
faults, and thanks Francesc for providing the links and the analysis
of the situation.  After reading those links and experimenting with
malloc in a couple C programs, it is clear now what is happening.  The
key factors to be aware of are:

* What the values of M_MMAP_THRESHOLD and M_TRIM_THRESHOLD are and how
they are updated dynamically.
* Minor page faults are triggered by new allocations from the system,
whether via mmap or via sbrk(). This means avoiding mmap is not enough
to avoid potential problems with minor page faults.
<snip>

Thanks for going down to the rabbit hole.  It is now much clearer what's going on.

If we set M_MMAP_THRESHOLD to value large enough that we never use
mmap, we get results like this (also noted by Francesc):

```
$ MALLOC_MMAP_THRESHOLD_=20000000 python timing_question.py
numpy version 1.20.3

 639.5976 microseconds
 641.1489 microseconds
 640.8228 microseconds

 385.0931 microseconds
 384.8406 microseconds
 384.0398 microseconds
```

In this case, setting the mmap threshold disables the dynamic updating
of all the thresholds, so M_TRIM_THRESHOLD remains 128*1024.  This
means each time free() is called, there will almost certainly be
enough memory at the end of the heap that exceeds M_TRIM_THRESHOLD and
so each iteration in timeit results memory being allocated from the
system and returned to the system via sbrk(), and that leads to minor
page faults when writing to the temporary arrays in each iterations.
That is, we still get thrashing at the top of the heap.  (Apparenly
not all the calls of free() lead to trimming the heap, so the
performance is better with the second z array.)

<snip>

This is still something that I don't get.  I suppose this is very minor, but it would be nice to see the reason for this.

FWIW, after some more reflection upon this, I think that a possible explanation for the above is that in both cases, the temporaries are returned using mmap(), and hence both incurring in the zeroing overhead.  If that is correct, this offers a new opportunity to better calculate where the overhead comes from.

Although I suspect that Warren's box and mine (AMD Ryzen 9 5950X 16-Core) are pretty similar, in order to compare apples with apples, I'm going to use my own measurements for this case:

$ MALLOC_MMAP_THRESHOLD_=20000000 python prova2.py
numpy version 1.21.2

 610.0853 microseconds
 610.7483 microseconds
 611.1118 microseconds

 369.2480 microseconds
 369.6340 microseconds
 369.8093 microseconds

So, in this case, the cost of a minor page fault is ~240 ns ((610us - 369us) / 1000 pages).  Now, the times for the original bench in my box:

$ python prova2.py
numpy version 1.21.2

 598.2740 microseconds
 609.1857 microseconds
 609.0713 microseconds

 137.5792 microseconds
 136.7661 microseconds
 136.8159 microseconds

Here the temporaries for the second z do not use mmap nor incur into page faults, so they are ((369us - 137us) / 1000 pages) = 233ns/page faster, so this should be related with zeroing.  IIRC, for my system, memset() can go up to 28 GB/s, so that accounts for 146ns of the 233ns. That leaves for an additional 87ns/page that, as it is very close to the memory latency of my box (around 78ns, see https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5), I think (wild guessing again) this is somehow related with the memory management unit (MMU) operation (but not totally sure).

I am pretty sure we are not yet there for understanding everything completely (e.g. I did leave CPU caches out of the analysis), but hey, it is a lot of fun going down to the rabbit holes indeed.

--
Francesc Alted