numexpr efficency depends on the size of the computing kernel
![](https://secure.gravatar.com/avatar/5c7407de6b47afcd3b3e2164ff5bcd45.jpg?s=120&d=mm&r=g)
Hi, Now that I'm commanding my old AMD Duron machine, I've made some benchmarks just to prove that the numexpr computing is not influenced by the size of the CPU cache, but I failed miserably (and Tim was right: there is a dependency of the numexpr efficency on CPU cache size). Provided that the pytables instance of the computing kernel of numexpr is quite larger (it supports more datatypes) than the original, comparing the performance of both versions can be a good way to check the influence of CPU cache on the computing efficency. The attached benchmark is a small modification of the timing.py that comes with the numexpr package (the modification was needed to allow the numexpr version of pytables to run all the cases). Basically, the expressions tested operations with arrays of 1 million of elements, with a mix of contiguous and strided arrays (no unaligned arrays are present here). See the code in benchmark for the details. The speed-ups of numexpr over plain numpy on a AMD Duron machine (64 + 64 KB L1 cache, 64 KB L2 cache) are: For the original numexpr package: 2.14, 2.21, 2.21 (these represent averages for 3 complete runs) For the modified pytables version (enlarged computing kernel): 1.32, 1.34, 1.37 So, with a CPU with a very small cache, the original numexpr kernel is 1.6x faster than the pytables one. However, using a AMD Opteron which has a much bigger L2 cache (64 + 64 KB L1 cache, 1 MB L2 cache), the speed-ups are quite similar: For the original numexpr package: 3.10, 3.35, 3.35 For the modified pytables version (enlarged computing kernel): 3.37, 3.50, 3.45 So, there is effectively a dependency on the CPU cache size. It would be nice to run the benchmark with other CPUs with a L2 cache in the range between 64 KB and 1 MB so as to find the point where the performance starts to be similar (this should be a good guess on the size of the computing kernel). Meanwhile, the lesson learned is that Tim worries were correct: one should be very careful on adding more opcodes (at least, until CPUs with a very small L2 cache are in use). With this, perhaps we will have to reduce the opcodes in the numexpr version for pytables to a bare minimum :-/ Cheers, -- Francesc Altet | Be careful about using the following code -- Carabos Coop. V. | I've only proven that it works, www.carabos.com | I haven't tested it. -- Donald Knuth
![](https://secure.gravatar.com/avatar/5c7407de6b47afcd3b3e2164ff5bcd45.jpg?s=120&d=mm&r=g)
Just a quick followup about this issue: After a bit of investigation, I discovered that the responsible for the difference in performance between the original numexpr and their PyTables counterpart (see the message below) was due *only* to the different flags used for compiling (and not to a cache instruction overload in the CPU). It turns out that the original numexpr always add the '-O2 -funroll-all-loops' flag (GCC compiler), while I compiled the PyTables instance with the python default (-O3). After recompiling the latter using the same flags as original numexpr, then I get exactly the same results on either version of numexpr , even with a processor with as a small secondary cache as 64 KB (AMD Duron) (i.e. the '-funroll-all-loops' flag seems to be *very* effective for optimizing the computing kernel of numexpr, at least with CPUs with small caches). So, at least, this leads to the conclusion that the numexpr's virtual machine is still far away from getting overloaded, most specially with nowadays processors with 512KB of secondary cache or more. Cheers, A Dimecres 14 Març 2007 22:05, Francesc Altet escrigué:
--
![](https://secure.gravatar.com/avatar/5c7407de6b47afcd3b3e2164ff5bcd45.jpg?s=120&d=mm&r=g)
Just a quick followup about this issue: After a bit of investigation, I discovered that the responsible for the difference in performance between the original numexpr and their PyTables counterpart (see the message below) was due *only* to the different flags used for compiling (and not to a cache instruction overload in the CPU). It turns out that the original numexpr always add the '-O2 -funroll-all-loops' flag (GCC compiler), while I compiled the PyTables instance with the python default (-O3). After recompiling the latter using the same flags as original numexpr, then I get exactly the same results on either version of numexpr , even with a processor with as a small secondary cache as 64 KB (AMD Duron) (i.e. the '-funroll-all-loops' flag seems to be *very* effective for optimizing the computing kernel of numexpr, at least with CPUs with small caches). So, at least, this leads to the conclusion that the numexpr's virtual machine is still far away from getting overloaded, most specially with nowadays processors with 512KB of secondary cache or more. Cheers, A Dimecres 14 Març 2007 22:05, Francesc Altet escrigué:
--
participants (1)
-
Francesc Altet