Performance mystery

In the script below, the evaluation of the expression `z.real**2 + z.imag**2` is timed using the `timeit` module. `z` is a 1D array of random samples with dtype `np.complex128` and with length 250000. The mystery is the change in performance of the calculation from the first array to which it is applied to the second. The output of the script is ``` numpy version 1.23.0.dev0+460.gc30876f64 619.7096 microseconds 625.3833 microseconds 634.8389 microseconds 137.0659 microseconds 137.5231 microseconds 137.5582 microseconds ``` Each batch of three timings corresponds to repeating the timeit operation three times on the same random array `z`; i.e. a new array `z` is generated for the second batch. The question is why is does it take so much longer to evaluate the expression the first time? Some other details: * If I change the expression to, say, `'z.real + z.imag'`, the huge disparity disappears. * If I generate more random `z` arrays, the performance remains at the level of approximately 140 usec. * I used the main branch of numpy for the above output, but the same thing happens with 1.20.3, so this is not the result of a recent change. * So far, when I run the script, I always see output like that shown above: the time required for the first random array is typically four times that required for the second array. If I run similar commands in ipython, I have seen the slow case repeated several times (with newly generated random arrays), but eventually the time drops down to 140 usec (or so), and I don't see the slow case anymore. * I'm using a 64 bit Linux computer: ``` $ uname -a Linux pop-os 5.15.8-76051508-generic #202112141040~1639505278~21.10~0ede46a SMP Tue Dec 14 22:38:29 U x86_64 x86_64 x86_64 GNU/Linux ``` Any ideas? Warren Here's the script: ``` import timeit import numpy as np def generate_sample(n, rng): return rng.normal(scale=1000, size=2*n).view(np.complex128) print(f'numpy version {np.__version__}') print() rng = np.random.default_rng() n = 250000 timeit_reps = 10000 expr = 'z.real**2 + z.imag**2' z = generate_sample(n, rng) for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print() z = generate_sample(n, rng) for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print() ```

On Tue, Jan 18, 2022, at 21:55, Warren Weckesser wrote:
expr = 'z.real**2 + z.imag**2'
z = generate_sample(n, rng)
🤔 If I duplicate the `z = ...` line, I get the fast result throughout. If, however, I use `generate_sample(1, rng)` (or any other value than `n`), it does not improve matters. Could this be a memory caching issue? Stéfan

On Wed, Jan 19, 2022 at 7:33 AM Stefan van der Walt <stefanv@berkeley.edu> wrote:
On Tue, Jan 18, 2022, at 21:55, Warren Weckesser wrote:
expr = 'z.real**2 + z.imag**2'
z = generate_sample(n, rng)
🤔 If I duplicate the `z = ...` line, I get the fast result throughout. If, however, I use `generate_sample(1, rng)` (or any other value than `n`), it does not improve matters.
Could this be a memory caching issue?
I can also reproduce that, but only on my Linux boxes. My MacMini does not notice the difference. Interestingly enough, you don't even need an additional call to `generate_sample(n, rng)`. If one use `z = np.empty(...)` and then do an assignment, like: z = np.empty(n, dtype=np.complex128) z[:] = generate_sample(n, rng) then everything runs at the same speed: numpy version 1.20.3 142.3667 microseconds 142.3717 microseconds 142.3781 microseconds 142.7593 microseconds 142.3579 microseconds 142.3231 microseconds As another data point, by doing the same operation but using numexpr I am not seeing any difference either, not even on Linux: numpy version 1.20.3 numexpr version 2.8.1 95.6513 microseconds 88.1804 microseconds 97.1322 microseconds 105.0833 microseconds 100.5555 microseconds 100.5654 microseconds [it is rather like a bit the other way around, the second iteration seems a hair faster] See the numexpr script below. I am totally puzzled here. """ import timeit import numpy as np import numexpr as ne def generate_sample(n, rng): return rng.normal(scale=1000, size=2*n).view(np.complex128) print(f'numpy version {np.__version__}') print(f'numexpr version {ne.__version__}') print() rng = np.random.default_rng() n = 250000 timeit_reps = 10000 expr = 'ne.evaluate("zreal**2 + zimag**2")' z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print() z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print() """ -- Francesc Alted

On Wed, Jan 19, 2022 at 11:50 AM Francesc Alted <faltet@gmail.com> wrote:
On Wed, Jan 19, 2022 at 7:33 AM Stefan van der Walt <stefanv@berkeley.edu> wrote:
On Tue, Jan 18, 2022, at 21:55, Warren Weckesser wrote:
expr = 'z.real**2 + z.imag**2'
z = generate_sample(n, rng)
🤔 If I duplicate the `z = ...` line, I get the fast result throughout. If, however, I use `generate_sample(1, rng)` (or any other value than `n`), it does not improve matters.
Could this be a memory caching issue?
I can also reproduce that, but only on my Linux boxes. My MacMini does not notice the difference.
Interestingly enough, you don't even need an additional call to `generate_sample(n, rng)`. If one use `z = np.empty(...)` and then do an assignment, like:
z = np.empty(n, dtype=np.complex128) z[:] = generate_sample(n, rng)
then everything runs at the same speed:
Just turning the view into a copy inside `generate_sample()` will also make the difference go away, for probably the same reason as this version. András
numpy version 1.20.3
142.3667 microseconds 142.3717 microseconds 142.3781 microseconds
142.7593 microseconds 142.3579 microseconds 142.3231 microseconds
As another data point, by doing the same operation but using numexpr I am not seeing any difference either, not even on Linux:
numpy version 1.20.3 numexpr version 2.8.1
95.6513 microseconds 88.1804 microseconds 97.1322 microseconds
105.0833 microseconds 100.5555 microseconds 100.5654 microseconds
[it is rather like a bit the other way around, the second iteration seems a hair faster] See the numexpr script below.
I am totally puzzled here.
""" import timeit import numpy as np import numexpr as ne
def generate_sample(n, rng): return rng.normal(scale=1000, size=2*n).view(np.complex128)
print(f'numpy version {np.__version__}') print(f'numexpr version {ne.__version__}') print()
rng = np.random.default_rng() n = 250000 timeit_reps = 10000
expr = 'ne.evaluate("zreal**2 + zimag**2")'
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print()
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print() """
-- Francesc Alted _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: deak.andris@gmail.com

On Wed, 2022-01-19 at 11:49 +0100, Francesc Alted wrote:
On Wed, Jan 19, 2022 at 7:33 AM Stefan van der Walt <stefanv@berkeley.edu> wrote:
On Tue, Jan 18, 2022, at 21:55, Warren Weckesser wrote:
expr = 'z.real**2 + z.imag**2'
z = generate_sample(n, rng)
🤔 If I duplicate the `z = ...` line, I get the fast result throughout. If, however, I use `generate_sample(1, rng)` (or any other value than `n`), it does not improve matters.
Could this be a memory caching issue?
Yes, it is a caching issue for sure. We have seen similar random fluctuations before. You can proof that it is a cache page-fault issue by running it with `perf --stat`. I did this twice, once with the second loop removed (page-faults only): 28333629 page-faults # 936.234 K/sec 28362718 page-faults # 1.147 M/sec The number of page faults is low. Running only the second one (or running the first one only once, rather), gave me: 15024 page-faults # 1.837 K/sec So that is the *reason*. I had before tried to figure out why the page faults differ too much, or if we can do something about it. But I never had any reasonable lead on it. In general, these fluctuations are pretty random, in the sense that unrelated code changes and recompilation can swap the behaviour easily. As Andras noted in that he sees the opposite. I would love to have an idea if there is a way to figure out why the page-faults are so imbalanced between the two. (I have not looked at CPU cache misses this time, but since page-faults happen, I assume that should not matter?) Cheers, Sebastian
I can also reproduce that, but only on my Linux boxes. My MacMini does not notice the difference.
Interestingly enough, you don't even need an additional call to `generate_sample(n, rng)`. If one use `z = np.empty(...)` and then do an assignment, like:
z = np.empty(n, dtype=np.complex128) z[:] = generate_sample(n, rng)
then everything runs at the same speed:
numpy version 1.20.3
 142.3667 microseconds  142.3717 microseconds  142.3781 microseconds
 142.7593 microseconds  142.3579 microseconds  142.3231 microseconds
As another data point, by doing the same operation but using numexpr I am not seeing any difference either, not even on Linux:
numpy version 1.20.3 numexpr version 2.8.1
 95.6513 microseconds  88.1804 microseconds  97.1322 microseconds
 105.0833 microseconds  100.5555 microseconds  100.5654 microseconds
[it is rather like a bit the other way around, the second iteration seems a hair faster] See the numexpr script below.
I am totally puzzled here.
""" import timeit import numpy as np import numexpr as ne
def generate_sample(n, rng): Â Â Â return rng.normal(scale=1000, size=2*n).view(np.complex128)
print(f'numpy version {np.__version__}') print(f'numexpr version {ne.__version__}') print()
rng = np.random.default_rng() n = 250000 timeit_reps = 10000
expr = 'ne.evaluate("zreal**2 + zimag**2")'
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): Â Â Â t = timeit.timeit(expr, globals=globals(), number=timeit_reps) Â Â Â print(f"{1e6*t/timeit_reps:9.4f} microseconds") print()
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): Â Â Â t = timeit.timeit(expr, globals=globals(), number=timeit_reps) Â Â Â print(f"{1e6*t/timeit_reps:9.4f} microseconds") print() """
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net

Given that this seems to be Linux only, is this related to how glibc does large allocations (>128kB) using mmap()? https://stackoverflow.com/a/33131385 On Wed, Jan 19, 2022 at 9:06 AM Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Wed, 2022-01-19 at 11:49 +0100, Francesc Alted wrote:
On Wed, Jan 19, 2022 at 7:33 AM Stefan van der Walt <stefanv@berkeley.edu> wrote:
On Tue, Jan 18, 2022, at 21:55, Warren Weckesser wrote:
expr = 'z.real**2 + z.imag**2'
z = generate_sample(n, rng)
🤔 If I duplicate the `z = ...` line, I get the fast result throughout. If, however, I use `generate_sample(1, rng)` (or any other value than `n`), it does not improve matters.
Could this be a memory caching issue?
Yes, it is a caching issue for sure. We have seen similar random fluctuations before. You can proof that it is a cache page-fault issue by running it with `perf --stat`. I did this twice, once with the second loop removed (page-faults only):
28333629 page-faults # 936.234 K/sec 28362718 page-faults # 1.147 M/sec
The number of page faults is low. Running only the second one (or running the first one only once, rather), gave me:
15024 page-faults # 1.837 K/sec
So that is the *reason*. I had before tried to figure out why the page faults differ too much, or if we can do something about it. But I never had any reasonable lead on it.
In general, these fluctuations are pretty random, in the sense that unrelated code changes and recompilation can swap the behaviour easily. As Andras noted in that he sees the opposite.
I would love to have an idea if there is a way to figure out why the page-faults are so imbalanced between the two.
(I have not looked at CPU cache misses this time, but since page-faults happen, I assume that should not matter?)
Cheers,
Sebastian
I can also reproduce that, but only on my Linux boxes. My MacMini does not notice the difference.
Interestingly enough, you don't even need an additional call to `generate_sample(n, rng)`. If one use `z = np.empty(...)` and then do an assignment, like:
z = np.empty(n, dtype=np.complex128) z[:] = generate_sample(n, rng)
then everything runs at the same speed:
numpy version 1.20.3
142.3667 microseconds 142.3717 microseconds 142.3781 microseconds
142.7593 microseconds 142.3579 microseconds 142.3231 microseconds
As another data point, by doing the same operation but using numexpr I am not seeing any difference either, not even on Linux:
numpy version 1.20.3 numexpr version 2.8.1
95.6513 microseconds 88.1804 microseconds 97.1322 microseconds
105.0833 microseconds 100.5555 microseconds 100.5654 microseconds
[it is rather like a bit the other way around, the second iteration seems a hair faster] See the numexpr script below.
I am totally puzzled here.
""" import timeit import numpy as np import numexpr as ne
def generate_sample(n, rng): return rng.normal(scale=1000, size=2*n).view(np.complex128)
print(f'numpy version {np.__version__}') print(f'numexpr version {ne.__version__}') print()
rng = np.random.default_rng() n = 250000 timeit_reps = 10000
expr = 'ne.evaluate("zreal**2 + zimag**2")'
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print()
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print() """
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sseibert@anaconda.com

On Wed, Jan 19, 2022 at 6:58 PM Stanley Seibert <sseibert@anaconda.com> wrote:
Given that this seems to be Linux only, is this related to how glibc does large allocations (>128kB) using mmap()?
That's a good point. As MMAP_THRESHOLD is 128 KB, and the size of `z` is almost 4 MB, mmap machinery is probably getting involved here. Also, as pages acquired via anonymous mmap are not actually allocated until you access them the first time, that would explain that the first access is slow. What puzzles me is that the timeit loops access `z` data 3*10000 times, which is plenty of time for doing the allocation (just should require just a single iteration).
On Wed, Jan 19, 2022 at 9:06 AM Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Wed, 2022-01-19 at 11:49 +0100, Francesc Alted wrote:
On Wed, Jan 19, 2022 at 7:33 AM Stefan van der Walt <stefanv@berkeley.edu> wrote:
On Tue, Jan 18, 2022, at 21:55, Warren Weckesser wrote:
expr = 'z.real**2 + z.imag**2'
z = generate_sample(n, rng)
🤔 If I duplicate the `z = ...` line, I get the fast result throughout. If, however, I use `generate_sample(1, rng)` (or any other value than `n`), it does not improve matters.
Could this be a memory caching issue?
Yes, it is a caching issue for sure. We have seen similar random fluctuations before. You can proof that it is a cache page-fault issue by running it with `perf --stat`. I did this twice, once with the second loop removed (page-faults only):
28333629 page-faults # 936.234 K/sec 28362718 page-faults # 1.147 M/sec
The number of page faults is low. Running only the second one (or running the first one only once, rather), gave me:
15024 page-faults # 1.837 K/sec
So that is the *reason*. I had before tried to figure out why the page faults differ too much, or if we can do something about it. But I never had any reasonable lead on it.
In general, these fluctuations are pretty random, in the sense that unrelated code changes and recompilation can swap the behaviour easily. As Andras noted in that he sees the opposite.
I would love to have an idea if there is a way to figure out why the page-faults are so imbalanced between the two.
(I have not looked at CPU cache misses this time, but since page-faults happen, I assume that should not matter?)
Cheers,
Sebastian
I can also reproduce that, but only on my Linux boxes. My MacMini does not notice the difference.
Interestingly enough, you don't even need an additional call to `generate_sample(n, rng)`. If one use `z = np.empty(...)` and then do an assignment, like:
z = np.empty(n, dtype=np.complex128) z[:] = generate_sample(n, rng)
then everything runs at the same speed:
numpy version 1.20.3
142.3667 microseconds 142.3717 microseconds 142.3781 microseconds
142.7593 microseconds 142.3579 microseconds 142.3231 microseconds
As another data point, by doing the same operation but using numexpr I am not seeing any difference either, not even on Linux:
numpy version 1.20.3 numexpr version 2.8.1
95.6513 microseconds 88.1804 microseconds 97.1322 microseconds
105.0833 microseconds 100.5555 microseconds 100.5654 microseconds
[it is rather like a bit the other way around, the second iteration seems a hair faster] See the numexpr script below.
I am totally puzzled here.
""" import timeit import numpy as np import numexpr as ne
def generate_sample(n, rng): return rng.normal(scale=1000, size=2*n).view(np.complex128)
print(f'numpy version {np.__version__}') print(f'numexpr version {ne.__version__}') print()
rng = np.random.default_rng() n = 250000 timeit_reps = 10000
expr = 'ne.evaluate("zreal**2 + zimag**2")'
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print()
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print() """
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sseibert@anaconda.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: faltet@gmail.com
-- Francesc Alted

On Wed, 19 Jan 2022 19:48:32 +0100 Francesc Alted <faltet@gmail.com> wrote:
What puzzles me is that the timeit loops access `z` data 3*10000 times, which is plenty of time for doing the allocation (just should require just a single iteration).
Hi all, Very interesting discussion ... IIRC, timeit does some clever tricks like disabling the garbage collector, so for the first run of timeit, I suspect it does 10000 malloc (without associated free thus without the ability of recycling any of those buffers). On the second run, those buffer were allocated previously and kept by Python for re-use which could explain the faster run. This does not explain really the different between operating systems observed. Cheers, Jerome

On Wed, 2022-01-19 at 20:49 +0100, Jerome Kieffer wrote:
On Wed, 19 Jan 2022 19:48:32 +0100 Francesc Alted <faltet@gmail.com> wrote:
What puzzles me is that the timeit loops access `z` data 3*10000 times, which is plenty of time for doing the allocation (just should require just a single iteration).
Hi all,
Very interesting discussion ... IIRC, timeit does some clever tricks like disabling the garbage collector, so for the first run of timeit,
Yes, but this will have no influence on the observation. (The following explanation applies to CPython and things are different on other Python implementations.) Timeit will disable the GC. But that does not mean that automatic memory management is disabled completely. It will have no affect on the code in question. Disabling the GC only disables complicated memory management: That is cleaning up reference circles. Such as a list containing itself, a class containing an attribute which contains itself, etc. Further, I suspect timeit is smart enough to allow running the GC between timing runs (i.e. at the setup stage). Cheers, Sebastian
I suspect it does 10000 malloc (without associated free thus without the ability of recycling any of those buffers). On the second run, those buffer were allocated previously and kept by Python for re-use which could explain the faster run.
This does not explain really the different between operating systems observed.
Cheers,
Jerome _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net

On Wed, 2022-01-19 at 19:48 +0100, Francesc Alted wrote:
On Wed, Jan 19, 2022 at 6:58 PM Stanley Seibert <sseibert@anaconda.com> wrote:
Given that this seems to be Linux only, is this related to how glibc does large allocations (>128kB) using mmap()?
That's a good point. As MMAP_THRESHOLD is 128 KB, and the size of `z` is almost 4 MB, mmap machinery is probably getting involved here. Also, as pages acquired via anonymous mmap are not actually allocated until you access them the first time, that would explain that the first access is slow. What puzzles me is that the timeit loops access `z` data 3*10000 times, which is plenty of time for doing the allocation (just should require just a single iteration).
Indeed, and I have no clue how to start understanding why the kernel ends up doing very different things in seemingly identical situations (I assume it is the kernel). Some memory fragmentation? First example just gets "lucky" for some reason to reuse a `malloc` exactly? …? We may need an expert in kernel virtual memory/page management :). Cheers, Sebastian
On Wed, Jan 19, 2022 at 9:06 AM Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Wed, 2022-01-19 at 11:49 +0100, Francesc Alted wrote:
On Wed, Jan 19, 2022 at 7:33 AM Stefan van der Walt <stefanv@berkeley.edu> wrote:
On Tue, Jan 18, 2022, at 21:55, Warren Weckesser wrote:
expr = 'z.real**2 + z.imag**2'
z = generate_sample(n, rng)
🤔 If I duplicate the `z = ...` line, I get the fast result throughout. If, however, I use `generate_sample(1, rng)` (or any other value than `n`), it does not improve matters.
Could this be a memory caching issue?
Yes, it is a caching issue for sure. We have seen similar random fluctuations before. You can proof that it is a cache page-fault issue by running it with `perf --stat`. I did this twice, once with the second loop removed (page-faults only):
28333629     page-faults              # 936.234 K/sec 28362718     page-faults              #   1.147 M/sec
The number of page faults is low. Running only the second one (or running the first one only once, rather), gave me:
15024     page-faults              #   1.837 K/sec
So that is the *reason*. I had before tried to figure out why the page faults differ too much, or if we can do something about it. But I never had any reasonable lead on it.
In general, these fluctuations are pretty random, in the sense that unrelated code changes and recompilation can swap the behaviour easily. As Andras noted in that he sees the opposite.
I would love to have an idea if there is a way to figure out why the page-faults are so imbalanced between the two.
(I have not looked at CPU cache misses this time, but since page- faults happen, I assume that should not matter?)
Cheers,
Sebastian
I can also reproduce that, but only on my Linux boxes. My MacMini does not notice the difference.
Interestingly enough, you don't even need an additional call to `generate_sample(n, rng)`. If one use `z = np.empty(...)` and then do an assignment, like:
z = np.empty(n, dtype=np.complex128) z[:] = generate_sample(n, rng)
then everything runs at the same speed:
numpy version 1.20.3
 142.3667 microseconds  142.3717 microseconds  142.3781 microseconds
 142.7593 microseconds  142.3579 microseconds  142.3231 microseconds
As another data point, by doing the same operation but using numexpr I am not seeing any difference either, not even on Linux:
numpy version 1.20.3 numexpr version 2.8.1
 95.6513 microseconds  88.1804 microseconds  97.1322 microseconds
 105.0833 microseconds  100.5555 microseconds  100.5654 microseconds
[it is rather like a bit the other way around, the second iteration seems a hair faster] See the numexpr script below.
I am totally puzzled here.
""" import timeit import numpy as np import numexpr as ne
def generate_sample(n, rng): Â Â Â return rng.normal(scale=1000, size=2*n).view(np.complex128)
print(f'numpy version {np.__version__}') print(f'numexpr version {ne.__version__}') print()
rng = np.random.default_rng() n = 250000 timeit_reps = 10000
expr = 'ne.evaluate("zreal**2 + zimag**2")'
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): Â Â Â t = timeit.timeit(expr, globals=globals(), number=timeit_reps) Â Â Â print(f"{1e6*t/timeit_reps:9.4f} microseconds") print()
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): Â Â Â t = timeit.timeit(expr, globals=globals(), number=timeit_reps) Â Â Â print(f"{1e6*t/timeit_reps:9.4f} microseconds") print() """
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sseibert@anaconda.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: faltet@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net

On Wed, Jan 19, 2022 at 7:48 PM Francesc Alted <faltet@gmail.com> wrote:
On Wed, Jan 19, 2022 at 6:58 PM Stanley Seibert <sseibert@anaconda.com> wrote:
Given that this seems to be Linux only, is this related to how glibc does large allocations (>128kB) using mmap()?
That's a good point. As MMAP_THRESHOLD is 128 KB, and the size of `z` is almost 4 MB, mmap machinery is probably getting involved here. Also, as pages acquired via anonymous mmap are not actually allocated until you access them the first time, that would explain that the first access is slow. What puzzles me is that the timeit loops access `z` data 3*10000 times, which is plenty of time for doing the allocation (just should require just a single iteration).
I think I have more evidence that what is happening here has to see of how the malloc mechanism works in Linux. I find the next explanation to be really good: https://sourceware.org/glibc/wiki/MallocInternals In addition, this excerpt of the mallopt manpage ( https://man7.org/linux/man-pages/man3/mallopt.3.html) is very significant: Note: Nowadays, glibc uses a dynamic mmap threshold by default. The initial value of the threshold is 128*1024, but when blocks larger than the current threshold and less than or equal to DEFAULT_MMAP_THRESHOLD_MAX are freed, the threshold is adjusted upward to the size of the freed block. When dynamic mmap thresholding is in effect, the threshold for trimming the heap is also dynamically adjusted to be twice the dynamic mmap threshold. Dynamic adjustment of the mmap threshold is disabled if any of the M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or M_MMAP_MAX parameters is set. This description matches closely what is happening here: after `z` is freed (replaced by another random array in the second part of the calculation), then dynamic mmap threshold enters and the threshold is increased by 2x of the freed block (~4MB in this case), so for the second part, the program break (i.e. where the heap ends) is increased instead, which is faster because this memory does not need to be zeroed before use. Interestingly, the M_MMAP_THRESHOLD for system malloc can be set by using the MALLOC_MMAP_THRESHOLD_ environment variable. For example, the original times are: $ python mmap-numpy.py numpy version 1.20.3 635.4752 microseconds 635.8906 microseconds 636.0661 microseconds 144.7238 microseconds 143.9147 microseconds 144.0621 microseconds but if we enforce to always use mmap: $ MALLOC_MMAP_THRESHOLD_=0 python mmap-numpy.py numpy version 1.20.3 628.8890 microseconds 628.0965 microseconds 628.7590 microseconds 640.9369 microseconds 641.5104 microseconds 642.4027 microseconds so first and second parts executes at the same (slow) speed. And, if we set the threshold to be exactly 4 MB: $ MALLOC_MMAP_THRESHOLD_=4194304 python mmap-numpy.py numpy version 1.20.3 630.7381 microseconds 631.3634 microseconds 632.2200 microseconds 382.6925 microseconds 380.1790 microseconds 380.0340 microseconds we see how performance is increased for the second part (although that not as much as without specifying the threshold manually; probably this manual setting prevents other optimizations to quick in). As a final check, if we use other malloc systems, like the excellent mimalloc (https://github.com/microsoft/mimalloc), we can get really good performance for the two parts: $ LD_PRELOAD=/usr/local/lib/libmimalloc.so python mmap-numpy.py numpy version 1.20.3 147.5968 microseconds 146.9028 microseconds 147.1794 microseconds 148.0905 microseconds 147.7667 microseconds 147.5180 microseconds However, as this is avoiding the mmap() calls, this approach probably uses more memory, specially when large arrays need to be handled. All in all, this is testimonial of how much memory handling can affect performance in modern computers. Perhaps it is time for testing different memory allocation strategies in NumPy and come up with suggestions for users. Francesc
On Wed, Jan 19, 2022 at 9:06 AM Sebastian Berg < sebastian@sipsolutions.net> wrote:
On Wed, 2022-01-19 at 11:49 +0100, Francesc Alted wrote:
On Wed, Jan 19, 2022 at 7:33 AM Stefan van der Walt <stefanv@berkeley.edu> wrote:
On Tue, Jan 18, 2022, at 21:55, Warren Weckesser wrote:
expr = 'z.real**2 + z.imag**2'
z = generate_sample(n, rng)
🤔 If I duplicate the `z = ...` line, I get the fast result throughout. If, however, I use `generate_sample(1, rng)` (or any other value than `n`), it does not improve matters.
Could this be a memory caching issue?
Yes, it is a caching issue for sure. We have seen similar random fluctuations before. You can proof that it is a cache page-fault issue by running it with `perf --stat`. I did this twice, once with the second loop removed (page-faults only):
28333629 page-faults # 936.234 K/sec 28362718 page-faults # 1.147 M/sec
The number of page faults is low. Running only the second one (or running the first one only once, rather), gave me:
15024 page-faults # 1.837 K/sec
So that is the *reason*. I had before tried to figure out why the page faults differ too much, or if we can do something about it. But I never had any reasonable lead on it.
In general, these fluctuations are pretty random, in the sense that unrelated code changes and recompilation can swap the behaviour easily. As Andras noted in that he sees the opposite.
I would love to have an idea if there is a way to figure out why the page-faults are so imbalanced between the two.
(I have not looked at CPU cache misses this time, but since page-faults happen, I assume that should not matter?)
Cheers,
Sebastian
I can also reproduce that, but only on my Linux boxes. My MacMini does not notice the difference.
Interestingly enough, you don't even need an additional call to `generate_sample(n, rng)`. If one use `z = np.empty(...)` and then do an assignment, like:
z = np.empty(n, dtype=np.complex128) z[:] = generate_sample(n, rng)
then everything runs at the same speed:
numpy version 1.20.3
142.3667 microseconds 142.3717 microseconds 142.3781 microseconds
142.7593 microseconds 142.3579 microseconds 142.3231 microseconds
As another data point, by doing the same operation but using numexpr I am not seeing any difference either, not even on Linux:
numpy version 1.20.3 numexpr version 2.8.1
95.6513 microseconds 88.1804 microseconds 97.1322 microseconds
105.0833 microseconds 100.5555 microseconds 100.5654 microseconds
[it is rather like a bit the other way around, the second iteration seems a hair faster] See the numexpr script below.
I am totally puzzled here.
""" import timeit import numpy as np import numexpr as ne
def generate_sample(n, rng): return rng.normal(scale=1000, size=2*n).view(np.complex128)
print(f'numpy version {np.__version__}') print(f'numexpr version {ne.__version__}') print()
rng = np.random.default_rng() n = 250000 timeit_reps = 10000
expr = 'ne.evaluate("zreal**2 + zimag**2")'
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print()
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print() """
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sseibert@anaconda.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: faltet@gmail.com
-- Francesc Alted
-- Francesc Alted

On Thu, 2022-01-20 at 14:41 +0100, Francesc Alted wrote:
On Wed, Jan 19, 2022 at 7:48 PM Francesc Alted <faltet@gmail.com> wrote:
On Wed, Jan 19, 2022 at 6:58 PM Stanley Seibert <sseibert@anaconda.com> wrote:
Given that this seems to be Linux only, is this related to how glibc does large allocations (>128kB) using mmap()?
That's a good point. As MMAP_THRESHOLD is 128 KB, and the size of `z` is almost 4 MB, mmap machinery is probably getting involved here. Also, as pages acquired via anonymous mmap are not actually allocated until you access them the first time, that would explain that the first access is slow. What puzzles me is that the timeit loops access `z` data 3*10000 times, which is plenty of time for doing the allocation (just should require just a single iteration).
I think I have more evidence that what is happening here has to see of how the malloc mechanism works in Linux. I find the next explanation to be really good:
Thanks for figuring this out! It has been bugging me a lot before. So it rather depends on how `malloc` works, and not the kernel. It is surprising how "random" this can look, but I suppose some examples just happen to sit almost exactly at the threshold. It might be interesting if we could tweak `mallopt` parameters for typical NumPy usage. But unless it is very clear, maybe a standalone module is better?
In addition, this excerpt of the mallopt manpage ( https://man7.org/linux/man-pages/man3/mallopt.3.html) is very significant:
<snip>
All in all, this is testimonial of how much memory handling can affect performance in modern computers. Perhaps it is time for testing different memory allocation strategies in NumPy and come up with suggestions for users.
You are probably aware, but Matti and Elias now added the ability to customize array data allocation in NumPy, so it should be straight forward to write a small package/module that tweaks the allocation strategy here. Cheers, Sebastian
Francesc
On Wed, Jan 19, 2022 at 9:06 AM Sebastian Berg < sebastian@sipsolutions.net> wrote:
On Wed, 2022-01-19 at 11:49 +0100, Francesc Alted wrote:
On Wed, Jan 19, 2022 at 7:33 AM Stefan van der Walt <stefanv@berkeley.edu> wrote:
On Tue, Jan 18, 2022, at 21:55, Warren Weckesser wrote: > expr = 'z.real**2 + z.imag**2' > > z = generate_sample(n, rng)
🤔 If I duplicate the `z = ...` line, I get the fast result throughout. If, however, I use `generate_sample(1, rng)` (or any other value than `n`), it does not improve matters.
Could this be a memory caching issue?
Yes, it is a caching issue for sure. We have seen similar random fluctuations before. You can proof that it is a cache page-fault issue by running it with `perf --stat`. I did this twice, once with the second loop removed (page-faults only):
28333629     page-faults              # 936.234 K/sec 28362718     page-faults              #   1.147 M/sec
The number of page faults is low. Running only the second one (or running the first one only once, rather), gave me:
15024     page-faults              #   1.837 K/sec
So that is the *reason*. I had before tried to figure out why the page faults differ too much, or if we can do something about it. But I never had any reasonable lead on it.
In general, these fluctuations are pretty random, in the sense that unrelated code changes and recompilation can swap the behaviour easily. As Andras noted in that he sees the opposite.
I would love to have an idea if there is a way to figure out why the page-faults are so imbalanced between the two.
(I have not looked at CPU cache misses this time, but since page-faults happen, I assume that should not matter?)
Cheers,
Sebastian
I can also reproduce that, but only on my Linux boxes. My MacMini does not notice the difference.
Interestingly enough, you don't even need an additional call to `generate_sample(n, rng)`. If one use `z = np.empty(...)` and then do an assignment, like:
z = np.empty(n, dtype=np.complex128) z[:] = generate_sample(n, rng)
then everything runs at the same speed:
numpy version 1.20.3
 142.3667 microseconds  142.3717 microseconds  142.3781 microseconds
 142.7593 microseconds  142.3579 microseconds  142.3231 microseconds
As another data point, by doing the same operation but using numexpr I am not seeing any difference either, not even on Linux:
numpy version 1.20.3 numexpr version 2.8.1
 95.6513 microseconds  88.1804 microseconds  97.1322 microseconds
 105.0833 microseconds  100.5555 microseconds  100.5654 microseconds
[it is rather like a bit the other way around, the second iteration seems a hair faster] See the numexpr script below.
I am totally puzzled here.
""" import timeit import numpy as np import numexpr as ne
def generate_sample(n, rng): Â Â Â return rng.normal(scale=1000, size=2*n).view(np.complex128)
print(f'numpy version {np.__version__}') print(f'numexpr version {ne.__version__}') print()
rng = np.random.default_rng() n = 250000 timeit_reps = 10000
expr = 'ne.evaluate("zreal**2 + zimag**2")'
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): Â Â Â t = timeit.timeit(expr, globals=globals(), number=timeit_reps) Â Â Â print(f"{1e6*t/timeit_reps:9.4f} microseconds") print()
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): Â Â Â t = timeit.timeit(expr, globals=globals(), number=timeit_reps) Â Â Â print(f"{1e6*t/timeit_reps:9.4f} microseconds") print() """
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sseibert@anaconda.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: faltet@gmail.com
-- Francesc Alted
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net

On Thu, Jan 20, 2022 at 4:10 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
Thanks for figuring this out! It has been bugging me a lot before. So it rather depends on how `malloc` works, and not the kernel. It is surprising how "random" this can look, but I suppose some examples just happen to sit almost exactly at the threshold.
Also, one should be aware that temporaries can appear in many situations in NumPy, it is not exactly easy to predict when this is going to bit you. In Warren's example he already noticed that the difference in performance was only noticeable when a complex expression was used ('z.real**2 + z.imag**2', so with temporaries), whereas a simple expression ('z.real + z.imag', no temporaries) did not trigger the behavior.
It might be interesting if we could tweak `mallopt` parameters for typical NumPy usage. But unless it is very clear, maybe a standalone module is better?
Well, I can see `mallopt` handling being useful just for NumPy purposes, but it is up to you to decide where to put this logic.
All in all, this is testimonial of how much memory handling can
affect performance in modern computers. Perhaps it is time for testing different memory allocation strategies in NumPy and come up with suggestions for users.
You are probably aware, but Matti and Elias now added the ability to customize array data allocation in NumPy, so it should be straight forward to write a small package/module that tweaks the allocation strategy here.
That's good to hear. Cheers, -- Francesc Alted

On 1/20/22, Francesc Alted <faltet@gmail.com> wrote:
On Wed, Jan 19, 2022 at 7:48 PM Francesc Alted <faltet@gmail.com> wrote:
On Wed, Jan 19, 2022 at 6:58 PM Stanley Seibert <sseibert@anaconda.com> wrote:
Given that this seems to be Linux only, is this related to how glibc does large allocations (>128kB) using mmap()?
That's a good point. As MMAP_THRESHOLD is 128 KB, and the size of `z` is almost 4 MB, mmap machinery is probably getting involved here. Also, as pages acquired via anonymous mmap are not actually allocated until you access them the first time, that would explain that the first access is slow. What puzzles me is that the timeit loops access `z` data 3*10000 times, which is plenty of time for doing the allocation (just should require just a single iteration).
I think I have more evidence that what is happening here has to see of how the malloc mechanism works in Linux. I find the next explanation to be really good:
https://sourceware.org/glibc/wiki/MallocInternals
In addition, this excerpt of the mallopt manpage ( https://man7.org/linux/man-pages/man3/mallopt.3.html) is very significant:
Note: Nowadays, glibc uses a dynamic mmap threshold by default. The initial value of the threshold is 128*1024, but when blocks larger than the current threshold and less than or equal to DEFAULT_MMAP_THRESHOLD_MAX are freed, the threshold is adjusted upward to the size of the freed block. When dynamic mmap thresholding is in effect, the threshold for trimming the heap is also dynamically adjusted to be twice the dynamic mmap threshold. Dynamic adjustment of the mmap threshold is disabled if any of the M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or M_MMAP_MAX parameters is set.
This description matches closely what is happening here: after `z` is freed (replaced by another random array in the second part of the calculation), then dynamic mmap threshold enters and the threshold is increased by 2x of the freed block (~4MB in this case), so for the second part, the program break (i.e. where the heap ends) is increased instead, which is faster because this memory does not need to be zeroed before use.
Interestingly, the M_MMAP_THRESHOLD for system malloc can be set by using the MALLOC_MMAP_THRESHOLD_ environment variable. For example, the original times are:
$ python mmap-numpy.py numpy version 1.20.3
635.4752 microseconds 635.8906 microseconds 636.0661 microseconds
144.7238 microseconds 143.9147 microseconds 144.0621 microseconds
but if we enforce to always use mmap:
$ MALLOC_MMAP_THRESHOLD_=0 python mmap-numpy.py numpy version 1.20.3
628.8890 microseconds 628.0965 microseconds 628.7590 microseconds
640.9369 microseconds 641.5104 microseconds 642.4027 microseconds
so first and second parts executes at the same (slow) speed. And, if we set the threshold to be exactly 4 MB:
$ MALLOC_MMAP_THRESHOLD_=4194304 python mmap-numpy.py numpy version 1.20.3
630.7381 microseconds 631.3634 microseconds 632.2200 microseconds
382.6925 microseconds 380.1790 microseconds 380.0340 microseconds
we see how performance is increased for the second part (although that not as much as without specifying the threshold manually; probably this manual setting prevents other optimizations to quick in).
As a final check, if we use other malloc systems, like the excellent mimalloc (https://github.com/microsoft/mimalloc), we can get really good performance for the two parts:
$ LD_PRELOAD=/usr/local/lib/libmimalloc.so python mmap-numpy.py numpy version 1.20.3
147.5968 microseconds 146.9028 microseconds 147.1794 microseconds
148.0905 microseconds 147.7667 microseconds 147.5180 microseconds
However, as this is avoiding the mmap() calls, this approach probably uses more memory, specially when large arrays need to be handled.
All in all, this is testimonial of how much memory handling can affect performance in modern computers. Perhaps it is time for testing different memory allocation strategies in NumPy and come up with suggestions for users.
Francesc
On Wed, Jan 19, 2022 at 9:06 AM Sebastian Berg < sebastian@sipsolutions.net> wrote:
On Wed, 2022-01-19 at 11:49 +0100, Francesc Alted wrote:
On Wed, Jan 19, 2022 at 7:33 AM Stefan van der Walt <stefanv@berkeley.edu> wrote:
On Tue, Jan 18, 2022, at 21:55, Warren Weckesser wrote: > expr = 'z.real**2 + z.imag**2' > > z = generate_sample(n, rng)
🤔 If I duplicate the `z = ...` line, I get the fast result throughout. If, however, I use `generate_sample(1, rng)` (or any other value than `n`), it does not improve matters.
Could this be a memory caching issue?
Yes, it is a caching issue for sure. We have seen similar random fluctuations before. You can proof that it is a cache page-fault issue by running it with `perf --stat`. I did this twice, once with the second loop removed (page-faults only):
28333629 page-faults # 936.234 K/sec 28362718 page-faults # 1.147 M/sec
The number of page faults is low. Running only the second one (or running the first one only once, rather), gave me:
15024 page-faults # 1.837 K/sec
So that is the *reason*. I had before tried to figure out why the page faults differ too much, or if we can do something about it. But I never had any reasonable lead on it.
In general, these fluctuations are pretty random, in the sense that unrelated code changes and recompilation can swap the behaviour easily. As Andras noted in that he sees the opposite.
I would love to have an idea if there is a way to figure out why the page-faults are so imbalanced between the two.
(I have not looked at CPU cache misses this time, but since page-faults happen, I assume that should not matter?)
Cheers,
Sebastian
I can also reproduce that, but only on my Linux boxes. My MacMini does not notice the difference.
Interestingly enough, you don't even need an additional call to `generate_sample(n, rng)`. If one use `z = np.empty(...)` and then do an assignment, like:
z = np.empty(n, dtype=np.complex128) z[:] = generate_sample(n, rng)
then everything runs at the same speed:
numpy version 1.20.3
142.3667 microseconds 142.3717 microseconds 142.3781 microseconds
142.7593 microseconds 142.3579 microseconds 142.3231 microseconds
As another data point, by doing the same operation but using numexpr I am not seeing any difference either, not even on Linux:
numpy version 1.20.3 numexpr version 2.8.1
95.6513 microseconds 88.1804 microseconds 97.1322 microseconds
105.0833 microseconds 100.5555 microseconds 100.5654 microseconds
[it is rather like a bit the other way around, the second iteration seems a hair faster] See the numexpr script below.
I am totally puzzled here.
""" import timeit import numpy as np import numexpr as ne
def generate_sample(n, rng): return rng.normal(scale=1000, size=2*n).view(np.complex128)
print(f'numpy version {np.__version__}') print(f'numexpr version {ne.__version__}') print()
rng = np.random.default_rng() n = 250000 timeit_reps = 10000
expr = 'ne.evaluate("zreal**2 + zimag**2")'
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print()
z = generate_sample(n, rng) zreal = z.real zimag = z.imag for _ in range(3): t = timeit.timeit(expr, globals=globals(), number=timeit_reps) print(f"{1e6*t/timeit_reps:9.4f} microseconds") print() """
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sseibert@anaconda.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: faltet@gmail.com
-- Francesc Alted
-- Francesc Alted
Thanks Sebastian for pointing out that the issue is (minor) page faults, and thanks Francesc for providing the links and the analysis of the situation. After reading those links and experimenting with malloc in a couple C programs, it is clear now what is happening. The key factors to be aware of are: * What the values of M_MMAP_THRESHOLD and M_TRIM_THRESHOLD are and how they are updated dynamically. * Minor page faults are triggered by new allocations from the system, whether via mmap or via sbrk(). This means avoiding mmap is not enough to avoid potential problems with minor page faults. In my test program, the pattern of allocating and freeing the 4M array followed by repeatedly allocating and freeing 2M temporaries (in the first batch of timeit() calls) triggers, in effect, a thrashing of the top of the heap, in which memory is repeatedly allocated from and then returned to the system via sbrk(). Here's detailed timeline of a simplified version of what the test was doing, if anyone is interested in the gory details... First note that the default value of M_MMAP_THRESHOLD is 128*1024. For allocations above this value, mmap will be used. The simplified pseudocode of the test program (not necessarily including all the temporaries that might be generated in the actual Python code): ``` z = malloc(4000000) # More than M_MMAP_THRESHOLD so mmap is used. z[...] = ... # Triggers 976 minor page faults (+ or -). # The test program computes z.real**2 + z.imag**2 10000 times # in the timeit call. This requires temporary arrays; call them # t1 and t2. The first time through the loop, we have... t1 = malloc(2000000) # More than M_MMAP_THRESHOLD so mmap is used. t1[...] = ... # Triggers 488 minor page faults. t2 = malloc(2000000) # Uses mmap. t2[...] = ... # Another 488 minor page faults. # The dynamic updating of the malloc parameters occurs in the call # to free(). These allocations used mmap, so the call to free() # returns the memory to the system. free(t1) # Dynamic update of thresholds # Now M_MMAP_THRESHOLD=2000000, M_TRIM_THRESHOLD=4000000 free(t2) # The second iteration of the timeit function: t1 = malloc(2000000) # Less than M_MMAP_THRESHOLD; heap is used. t1[...] = ... # Triggers 488 minor page faults. t2 = malloc(2000000) # Uses heap. t2[...] = ... # Another 488 minor page faults. # The allocations of t1 and t2 having caused the heap to grow via # call(s) to sbrk(). free(t1) free(t2) # When free(t2) is called, there is a 4000000 byte block at the # end of the heap whose size equals M_TRIM_THRESHOLD, so it is # returned to the system via sbrk(). # The remaining 9998 iterations are the same... t1 = malloc(2000000) # Less than M_MMAP_THRESHOLD; heap is used. t1[...] = ... # Triggers 488 minor page faults. t2 = malloc(2000000) # Uses heap. t2[...] = ... # Another 488 minor page faults. # Once again, the allocations of t1 and t2 having caused the heap # to grow via call(s) to sbrk(). free(t1) free(t2) # And once again, there is now a 4000000 byte block at the end of # the heap whose size equals M_TRIM_THRESHOLD, so it is returned # to the system via sbrk(). # The important point here is that in each iteration after the # first, malloc() allocates memory from the system and returns # it to the system via sbrk(). Writing to that newly allocated # memory the first time triggers the minor page faults. # After running timeit with the first z, we free z and then # reallocate another 4000000 for z. free(z) # This free() triggers another dynamic update, so now # M_MMAP_THRESHOLD=4000000 and M_TRIM_THRESHOLD=8000000. z = malloc(4000000) # Uses heap. z[...] = ... # 976 minor page faults. # First timeit iteration with the new z: t1 = malloc(2000000) # Less than M_MMAP_THRESHOLD; heap is used. t1[...] = ... # Triggers 488 minor page faults. t2 = malloc(2000000) # Uses heap. t2[...] = ... # Another 488 minor page faults. free(t1) # Thresholds are not updated in this case. free(t2) # The above calls to free() do not result in the heap being # trimmed, because M_TRIM_THRESHOLD is now 8000000, and freeing # the two 2M blocks does not create a big enough block at the # top of the heap to be trimmed. # Remaing 9999 timeit iterations... # Because the heap was not trimmed, the following calls to # malloc() for t1 and t2 will not require the heap to be grown # via sbrk(). t1 = malloc(2000000) # Less than M_MMAP_THRESHOLD; heap is used. t1[...] = ... # No minor page faults (yay!). t2 = malloc(2000000) # Uses heap. t2[...] = ... # No minor page faults. free(t1) # Thresholds are not updated in this case. free(t2) # As before since M_TRIM_THRESHOLD is now 8000000, the calls to # free() above do not trigger a trim of the heap. ``` When the timeit loops are run with the second z array, minor page faults occur only when z is initialized, and in the first iteration of the timeit loop (the first time the temporary arrays are created and written to). In the subsequent 9999 iterations, the memory for the temporaries is available on the heap, so no more minor page faults are generated. This is why the timeit results with the second z array are much better. If we set the M_MMAP_THRESHOLD to 0 via the MALLOC_MMAP_THRESHOLD_ environment variable, we force all allocations to use mmap, and the corresponding free() returns the memory the system, so the timings are equally slow for the first and second z array. If we set M_MMAP_THRESHOLD to value large enough that we never use mmap, we get results like this (also noted by Francesc): ``` $ MALLOC_MMAP_THRESHOLD_=20000000 python timing_question.py numpy version 1.20.3 639.5976 microseconds 641.1489 microseconds 640.8228 microseconds 385.0931 microseconds 384.8406 microseconds 384.0398 microseconds ``` In this case, setting the mmap threshold disables the dynamic updating of all the thresholds, so M_TRIM_THRESHOLD remains 128*1024. This means each time free() is called, there will almost certainly be enough memory at the end of the heap that exceeds M_TRIM_THRESHOLD and so each iteration in timeit results memory being allocated from the system and returned to the system via sbrk(), and that leads to minor page faults when writing to the temporary arrays in each iterations. That is, we still get thrashing at the top of the heap. (Apparenly not all the calls of free() lead to trimming the heap, so the performance is better with the second z array.) To get good performance for both the first and second z arrays, we can manually set both the mmap threshold and the trim threshold: ``` $ MALLOC_MMAP_THRESHOLD_=20000000 MALLOC_TRIM_THRESHOLD_=20000000 python timing_question.py numpy version 1.20.3 140.9165 microseconds 148.2753 microseconds 150.8209 microseconds 152.6369 microseconds 152.5829 microseconds 152.4377 microseconds ``` Or disable trimming completely: ``` $ MALLOC_MMAP_THRESHOLD_=20000000 MALLOC_TRIM_THRESHOLD_=-1 python timing_question.py numpy version 1.20.3 144.9151 microseconds 147.7243 microseconds 153.9987 microseconds 155.9898 microseconds 156.0203 microseconds 155.9820 microseconds ``` We can also avoid the issue by using an alternative memory management library such as mimalloc (noted by Francesc) or jemalloc. Warren

On Mon, Jan 24, 2022 at 2:15 AM Warren Weckesser <warren.weckesser@gmail.com> wrote:
Thanks Sebastian for pointing out that the issue is (minor) page faults, and thanks Francesc for providing the links and the analysis of the situation. After reading those links and experimenting with malloc in a couple C programs, it is clear now what is happening. The key factors to be aware of are:
* What the values of M_MMAP_THRESHOLD and M_TRIM_THRESHOLD are and how they are updated dynamically. * Minor page faults are triggered by new allocations from the system, whether via mmap or via sbrk(). This means avoiding mmap is not enough to avoid potential problems with minor page faults.
<snip> Thanks for going down to the rabbit hole. It is now much clearer what's going on. If we set M_MMAP_THRESHOLD to value large enough that we never use
mmap, we get results like this (also noted by Francesc):
``` $ MALLOC_MMAP_THRESHOLD_=20000000 python timing_question.py numpy version 1.20.3
639.5976 microseconds 641.1489 microseconds 640.8228 microseconds
385.0931 microseconds 384.8406 microseconds 384.0398 microseconds ```
In this case, setting the mmap threshold disables the dynamic updating of all the thresholds, so M_TRIM_THRESHOLD remains 128*1024. This means each time free() is called, there will almost certainly be enough memory at the end of the heap that exceeds M_TRIM_THRESHOLD and so each iteration in timeit results memory being allocated from the system and returned to the system via sbrk(), and that leads to minor page faults when writing to the temporary arrays in each iterations. That is, we still get thrashing at the top of the heap. (Apparenly not all the calls of free() lead to trimming the heap, so the performance is better with the second z array.)
<snip> This is still something that I don't get. I suppose this is very minor, but it would be nice to see the reason for this. At any rate, with what we know now, one can estimate the origin of the overhead for the first z array. Going back to the original timing: 602.1964 microseconds 601.0663 microseconds 601.0728 microseconds 139.5745 microseconds 139.1559 microseconds 138.9791 microseconds The 460 microsecs (600 - 140) is the overhead of dealing with temporaries. That means that, as there are around 1000 page faults, each page fault accounts for around 460 nanosecs. Having into account that memory latency for this machine is around 80 nanosecs, most of the overhead must come for other source. My (wild) guess is that this is the zeroing for each of the temporary that the OS makes for the anonymous mmap() that it has to make. All in all, hats off to Warren for this entertaining piece of investigation! -- Francesc Alted

On Mon, Jan 24, 2022 at 10:02 AM Francesc Alted <faltet@gmail.com> wrote:
Thanks for going down to the rabbit hole. It is now much clearer what's going on.
All in all, hats off to Warren for this entertaining piece of investigation!
I can't resist. You'd expect someone called https://en.wikipedia.org/wiki/Warren to have special skills regarding rabbit holes! -- Jonathan Fine (Yes, that's really my family name)

On Mon, Jan 24, 2022 at 11:01 AM Francesc Alted <faltet@gmail.com> wrote:
On Mon, Jan 24, 2022 at 2:15 AM Warren Weckesser < warren.weckesser@gmail.com> wrote:
Thanks Sebastian for pointing out that the issue is (minor) page faults, and thanks Francesc for providing the links and the analysis of the situation. After reading those links and experimenting with malloc in a couple C programs, it is clear now what is happening. The key factors to be aware of are:
* What the values of M_MMAP_THRESHOLD and M_TRIM_THRESHOLD are and how they are updated dynamically. * Minor page faults are triggered by new allocations from the system, whether via mmap or via sbrk(). This means avoiding mmap is not enough to avoid potential problems with minor page faults.
<snip>
Thanks for going down to the rabbit hole. It is now much clearer what's going on.
If we set M_MMAP_THRESHOLD to value large enough that we never use
mmap, we get results like this (also noted by Francesc):
``` $ MALLOC_MMAP_THRESHOLD_=20000000 python timing_question.py numpy version 1.20.3
639.5976 microseconds 641.1489 microseconds 640.8228 microseconds
385.0931 microseconds 384.8406 microseconds 384.0398 microseconds ```
In this case, setting the mmap threshold disables the dynamic updating of all the thresholds, so M_TRIM_THRESHOLD remains 128*1024. This means each time free() is called, there will almost certainly be enough memory at the end of the heap that exceeds M_TRIM_THRESHOLD and so each iteration in timeit results memory being allocated from the system and returned to the system via sbrk(), and that leads to minor page faults when writing to the temporary arrays in each iterations. That is, we still get thrashing at the top of the heap. (Apparenly not all the calls of free() lead to trimming the heap, so the performance is better with the second z array.)
<snip>
This is still something that I don't get. I suppose this is very minor, but it would be nice to see the reason for this.
FWIW, after some more reflection upon this, I think that a possible explanation for the above is that in both cases, the temporaries are returned using mmap(), and hence both incurring in the zeroing overhead. If that is correct, this offers a new opportunity to better calculate where the overhead comes from. Although I suspect that Warren's box and mine (AMD Ryzen 9 5950X 16-Core) are pretty similar, in order to compare apples with apples, I'm going to use my own measurements for this case: $ MALLOC_MMAP_THRESHOLD_=20000000 python prova2.py numpy version 1.21.2 610.0853 microseconds 610.7483 microseconds 611.1118 microseconds 369.2480 microseconds 369.6340 microseconds 369.8093 microseconds So, in this case, the cost of a minor page fault is ~240 ns ((610us - 369us) / 1000 pages). Now, the times for the original bench in my box: $ python prova2.py numpy version 1.21.2 598.2740 microseconds 609.1857 microseconds 609.0713 microseconds 137.5792 microseconds 136.7661 microseconds 136.8159 microseconds Here the temporaries for the second z do not use mmap nor incur into page faults, so they are ((369us - 137us) / 1000 pages) = 233ns/page faster, so this should be related with zeroing. IIRC, for my system, memset() can go up to 28 GB/s, so that accounts for 146ns of the 233ns. That leaves for an additional 87ns/page that, as it is very close to the memory latency of my box (around 78ns, see https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-...), I think (wild guessing again) this is somehow related with the memory management unit (MMU) operation (but not totally sure). I am pretty sure we are not yet there for understanding everything completely (e.g. I did leave CPU caches out of the analysis), but hey, it is a lot of fun going down to the rabbit holes indeed. -- Francesc Alted

On Mon, Jan 24, 2022 at 1:42 PM Francesc Alted <faltet@gmail.com> wrote:
On Mon, Jan 24, 2022 at 11:01 AM Francesc Alted <faltet@gmail.com> wrote:
On Mon, Jan 24, 2022 at 2:15 AM Warren Weckesser < warren.weckesser@gmail.com> wrote:
Thanks Sebastian for pointing out that the issue is (minor) page faults, and thanks Francesc for providing the links and the analysis of the situation. After reading those links and experimenting with malloc in a couple C programs, it is clear now what is happening. The key factors to be aware of are:
* What the values of M_MMAP_THRESHOLD and M_TRIM_THRESHOLD are and how they are updated dynamically. * Minor page faults are triggered by new allocations from the system, whether via mmap or via sbrk(). This means avoiding mmap is not enough to avoid potential problems with minor page faults.
<snip>
Thanks for going down to the rabbit hole. It is now much clearer what's going on.
If we set M_MMAP_THRESHOLD to value large enough that we never use
mmap, we get results like this (also noted by Francesc):
``` $ MALLOC_MMAP_THRESHOLD_=20000000 python timing_question.py numpy version 1.20.3
639.5976 microseconds 641.1489 microseconds 640.8228 microseconds
385.0931 microseconds 384.8406 microseconds 384.0398 microseconds ```
In this case, setting the mmap threshold disables the dynamic updating of all the thresholds, so M_TRIM_THRESHOLD remains 128*1024. This means each time free() is called, there will almost certainly be enough memory at the end of the heap that exceeds M_TRIM_THRESHOLD and so each iteration in timeit results memory being allocated from the system and returned to the system via sbrk(), and that leads to minor page faults when writing to the temporary arrays in each iterations. That is, we still get thrashing at the top of the heap. (Apparenly not all the calls of free() lead to trimming the heap, so the performance is better with the second z array.)
<snip>
This is still something that I don't get. I suppose this is very minor, but it would be nice to see the reason for this.
FWIW, after some more reflection upon this, I think that a possible explanation for the above is that in both cases, the temporaries are returned using mmap(), and hence both incurring in the zeroing overhead. If that is correct, this offers a new opportunity to better calculate where the overhead comes from.
Although I suspect that Warren's box and mine (AMD Ryzen 9 5950X 16-Core) are pretty similar, in order to compare apples with apples, I'm going to use my own measurements for this case:
$ MALLOC_MMAP_THRESHOLD_=20000000 python prova2.py numpy version 1.21.2
610.0853 microseconds 610.7483 microseconds 611.1118 microseconds
369.2480 microseconds 369.6340 microseconds 369.8093 microseconds
So, in this case, the cost of a minor page fault is ~240 ns ((610us - 369us) / 1000 pages). Now, the times for the original bench in my box:
$ python prova2.py numpy version 1.21.2
598.2740 microseconds 609.1857 microseconds 609.0713 microseconds
137.5792 microseconds 136.7661 microseconds 136.8159 microseconds
Here the temporaries for the second z do not use mmap nor incur into page faults, so they are ((369us - 137us) / 1000 pages) = 233ns/page faster, so this should be related with zeroing. IIRC, for my system, memset() can go up to 28 GB/s, so that accounts for 146ns of the 233ns. That leaves for an additional 87ns/page that, as it is very close to the memory latency of my box (around 78ns, see https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-...), I think (wild guessing again) this is somehow related with the memory management unit (MMU) operation (but not totally sure).
For the record, while being at bed this night I could not get rid of the feeling that my analysis above was wrong. Indeed, after reviewing it this morning, yesterday I got the results reversed: the ~240 ns/page is actually for memset() (and possibly others), whereas the ~233 ns/page figure is for page fault instead. Fortunately both figures are close enough so that the outcome of the analysis remains barely the same. At any rate, I am guilty of being too naive in trying to understand (and measure) where the latencies come from, because memory allocation, and more in particular, virtual memory, are quite complex beasts to be completely understood without spending half of your life (just for Linux, see e.g. https://rigtorp.se/virtual-memory/). -- Francesc Alted
participants (9)
-
Andras Deak
-
Andrew Nelson
-
Francesc Alted
-
Jerome Kieffer
-
Jonathan Fine
-
Sebastian Berg
-
Stanley Seibert
-
Stefan van der Walt
-
Warren Weckesser