FWIW, after some more reflection upon this, I think that a possible explanation for the above is that in both cases, the temporaries are returned using mmap(), and hence both incurring in the zeroing overhead. If that is correct, this offers a new opportunity to better calculate where the overhead comes from.
Although I suspect that Warren's box and mine (AMD Ryzen 9 5950X 16-Core) are pretty similar, in order to compare apples with apples, I'm going to use my own measurements for this case:
$ MALLOC_MMAP_THRESHOLD_=20000000 python prova2.py
numpy version 1.21.2
610.0853 microseconds
610.7483 microseconds
611.1118 microseconds
369.2480 microseconds
369.6340 microseconds
369.8093 microseconds
So, in this case, the cost of a minor page fault is ~240 ns ((610us - 369us) / 1000 pages). Now, the times for the original bench in my box:
$ python prova2.py
numpy version 1.21.2
598.2740 microseconds
609.1857 microseconds
609.0713 microseconds
137.5792 microseconds
136.7661 microseconds
136.8159 microseconds
Here the temporaries for the second z do not use mmap nor incur into page faults, so they are ((369us - 137us) / 1000 pages) = 233ns/page faster, so this should be related with zeroing. IIRC, for my system, memset() can go up to 28 GB/s, so that accounts for 146ns of the 233ns. That leaves for an additional 87ns/page that, as it is very close to the memory latency of my box (around 78ns, see
https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5), I think (wild guessing again) this is somehow related with the memory management unit (MMU) operation (but not totally sure).
I am pretty sure we are not yet there for understanding everything completely (e.g. I did leave CPU caches out of the analysis), but hey, it is a lot of fun going down to the rabbit holes indeed.