Less is more? Smaller code and data to fit more into the CPU cache?
Hi As you may have seen, AMD has recently announced CPUs that have much larger L3 caches. Does anyone know of any work that's been done to research or make critical Python code and data smaller so that more of it fits in the CPU cache? I'm particularly interested in measured benefits. This search https://www.google.com/search?q=python+performance+CPU+cache+size provides two relevant links https://www.oreilly.com/library/view/high-performance-python/9781449361747/c... https://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2016/slides/PyH... but not much else I found relevant. AnandTech writes about the chips with triple the L3 cache: https://www.anandtech.com/show/17323/amd-releases-milan-x-cpus-with-3d-vcach... "As with other chips that incorporate larger caches, the greatest benefits are going to be found in workloads that spill out of contemporary-sized caches, but will neatly fit into the larger cache." And also: https://www.anandtech.com/show/17313/ryzen-7-5800x3d-launches-april-20th-plu... " As detailed by the company back at CES 2022 and reiterated in today’s announcement, AMD has found that the chip is 15% faster at gaming than their Ryzen 9 5900X." I already know that using Non Uniform Memory Access (NUMA) raises the difficult problem of cache coherence. https://en.wikipedia.org/wiki/Non-uniform_memory_access https://en.wikipedia.org/wiki/Cache_coherence -- Jonathan
On Wed, Mar 23, 2022 at 12:59 AM Jonathan Fine <jfine2358@gmail.com> wrote:
Does anyone know of any work that's been done to research or make critical Python code and data smaller so that more of it fits in the CPU cache? I'm particularly interested in measured benefits.
I reduced the size of namespace dict in Python 3.11. This will increase the cache efficiency. https://bugs.python.org/issue46845 And I deprecated the cached hash in bytes object. It will removed in Python 3.13 if no objections. Bytes objects are used to bytecode, so this will increase cache efficiency too. Sadly, I can not confirm the benefits. We have macro benchmark (pypeformance), but it is still small. Most hot data fits into L2 cache. Regards, -- Inada Naoki <songofacandy@gmail.com>
Hi Thank you Inada for your prompt and helpful reply. Here's a link for cached hash in bytes object: https://bugs.python.org/issue46864 What I have in mind is making selected objects smaller, for example by using smaller pointers. But how to know the performance benefit this will give? I think it would be helpful to know how much SLOWER things are when we make Python objects say 8 or 16 bytes LARGER. This would give an estimate of the improvement from making all Python objects smaller. I've not done much performance testing before. Anyone here interested in doing it, or helping me do it? (Warning - I've never built Python before.) with best regards Jonathan
Look up Judy arrays. Specialized data structure for this. It's got a simple API, but it's incredibly complex and architecture-specific under the hood. People are always trying to optimize, but there are limits to how much you can do on generic data structures (and how much you can do in general), and even specialized data structures often only help in specialized workloads. On Sun, Mar 27, 2022 at 1:18 PM Jonathan Fine <jfine2358@gmail.com> wrote:
Hi
Thank you Inada for your prompt and helpful reply. Here's a link for cached hash in bytes object: https://bugs.python.org/issue46864
What I have in mind is making selected objects smaller, for example by using smaller pointers. But how to know the performance benefit this will give?
I think it would be helpful to know how much SLOWER things are when we make Python objects say 8 or 16 bytes LARGER. This would give an estimate of the improvement from making all Python objects smaller.
I've not done much performance testing before. Anyone here interested in doing it, or helping me do it? (Warning - I've never built Python before.)
with best regards
Jonathan _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/PMDNJF... Code of Conduct: http://python.org/psf/codeofconduct/
On 27 Mar 2022, at 18:16, Jonathan Fine <jfine2358@gmail.com> wrote:
Hi
Thank you Inada for your prompt and helpful reply. Here's a link for cached hash in bytes object: https://bugs.python.org/issue46864 <https://bugs.python.org/issue46864>
What I have in mind is making selected objects smaller, for example by using smaller pointers. But how to know the performance benefit this will give?
That will limit python to 2GiB or maybe 4GiB of memory - I routinely run beyond that size in production systems. There is a memory model that GCC supports that is 32bit pointers and 64bit ints. I do not recall the performance comparisons, buts it is not used very much.
I think it would be helpful to know how much SLOWER things are when we make Python objects say 8 or 16 bytes LARGER. This would give an estimate of the improvement from making all Python objects smaller.
I've not done much performance testing before. Anyone here interested in doing it, or helping me do it? (Warning - I've never built Python before.)
Performance tests is hard to get right. I do this as my day job for a big Cloud app written mostly in Python. There is an excellent book on performance measurement by Brendan Greg "Systems Performance; Enterprise and the Cloud". Barry
with best regards
Jonathan _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/PMDNJF... Code of Conduct: http://python.org/psf/codeofconduct/
On 22 Mar 2022, at 15:57, Jonathan Fine <jfine2358@gmail.com> wrote:
Hi
As you may have seen, AMD has recently announced CPUs that have much larger L3 caches. Does anyone know of any work that's been done to research or make critical Python code and data smaller so that more of it fits in the CPU cache? I'm particularly interested in measured benefits.
I few years ago (5? 10?) there was a blog about making the python eval loop fit into L1 cache. The author gave up on the work as he claimed it was too hard to contribute any changes to python at the time. I have not kept a link to the blog post sadly. What I recall is that the author found that GCC was producing far more code then was required to implement sections of ceval.c. Fixing that would shrink the ceval code by 50% I recall was the claim. He had a PoC that showed the improvements. Then there was research on the opcodes and eliminating unnecessary code in there implementation, but I do not trust that I remember the details of this, its been too long since I read the blog. Barry
This search https://www.google.com/search?q=python+performance+CPU+cache+size <https://www.google.com/search?q=python+performance+CPU+cache+size> provides two relevant links https://www.oreilly.com/library/view/high-performance-python/9781449361747/c... <https://www.oreilly.com/library/view/high-performance-python/9781449361747/ch01.html> https://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2016/slides/PyH... <https://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2016/slides/PyHPC_2016_talk_14.pdf> but not much else I found relevant.
AnandTech writes about the chips with triple the L3 cache: https://www.anandtech.com/show/17323/amd-releases-milan-x-cpus-with-3d-vcach... <https://www.anandtech.com/show/17323/amd-releases-milan-x-cpus-with-3d-vcache-epyc-7003> "As with other chips that incorporate larger caches, the greatest benefits are going to be found in workloads that spill out of contemporary-sized caches, but will neatly fit into the larger cache."
And also: https://www.anandtech.com/show/17313/ryzen-7-5800x3d-launches-april-20th-plu... <https://www.anandtech.com/show/17313/ryzen-7-5800x3d-launches-april-20th-plus-6-new-low-mid-range-ryzen-chips> " As detailed by the company back at CES 2022 and reiterated in today’s announcement, AMD has found that the chip is 15% faster at gaming than their Ryzen 9 5900X."
I already know that using Non Uniform Memory Access (NUMA) raises the difficult problem of cache coherence. https://en.wikipedia.org/wiki/Non-uniform_memory_access <https://en.wikipedia.org/wiki/Non-uniform_memory_access> https://en.wikipedia.org/wiki/Cache_coherence <https://en.wikipedia.org/wiki/Cache_coherence>
-- Jonathan
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/CUKUKY... Code of Conduct: http://python.org/psf/codeofconduct/
Barry Scott schrieb am 27.03.22 um 22:23:
On 22 Mar 2022, at 15:57, Jonathan Fine wrote: As you may have seen, AMD has recently announced CPUs that have much larger L3 caches. Does anyone know of any work that's been done to research or make critical Python code and data smaller so that more of it fits in the CPU cache? I'm particularly interested in measured benefits.
I few years ago (5? 10?) there was a blog about making the python eval loop fit into L1 cache. The author gave up on the work as he claimed it was too hard to contribute any changes to python at the time. I have not kept a link to the blog post sadly.
What I recall is that the author found that GCC was producing far more code then was required to implement sections of ceval.c. Fixing that would shrink the ceval code by 50% I recall was the claim. He had a PoC that showed the improvements.
Might be worth trying out if "gcc -Os" changes anything for ceval.c. Can also be enabled temporarily with a pragma (and MSVC has a similar option). We use it in Cython for the (run once) module init code to reduce the binary module size, but it might have an impact on cache usage as well. Stefan
participants (5)
-
Barry Scott
-
Inada Naoki
-
John
-
Jonathan Fine
-
Stefan Behnel