Hello, before posting to python-dev I thought is is the best to discuss this here. And I assume that someone else had the same idea then me before. Maybe you can point me to the relevant discussion/ticket. I read about Intels hybrid CPUs. It means there are multiple cores e.g. 8 high-speed cores and 8 low-speed (but more energy efficient) cores combined in one CPU. In my use cases I do parallelize with Pythons multiprocessing package to work on millions of rows on pandas.DataFrame objects. This are task that are not vecotrizable. I simple cut the DataFrame horizontal in pieces (numbered by the available cores). But when the cores are different in there "speed" I need to know that. e.g. with a 16 core CPU where half of the cores low/slow and every core has 1 million rows to work on. The 8 high speed cores are finishing earlier and just waiting untill the slow cores are finished. It would be more efficient if the 8 high speed cores each would work on 1,3 million rows and the low speed cores each on 0,7 million rows. It is not perfect but better. I know that they will not finish all at the same timepoint. But their end time will be closer together. But to do this I need to know the type of the cores. Am I wrong? Are there any plans in the Python development taking this into account? Kind Christian
So, It is out of scope of Pythonmultiprocessing, and, as I perceive it, from the stdlib as a whole to be able to allocate specific cores for each subprocess - that is automatically done by the O.S. (and of course, the O.S. having an interface for it, one can write a specific Python library which would allow this granularity, and it could even check core capabilities). As it stands however, is that you simply have to change your approach: instead of dividing yoru workload into different cores before starting, the common approach there is to set up worker processes, one per core, or per processor thread, and use those as a pool of resources to which you submit your processing work in chunks. In that way, if a worker happens to be in a faster core, it will be done with its chunk earlier and accept more work before slower cores are available. If you use "concurrent.futures" or a similar approach, this pattern will happen naturally with no specific fiddling needed on your part. On Wed, 18 Aug 2021 at 09:19, <c.buhtz@posteo.jp> wrote:
Hello,
before posting to python-dev I thought is is the best to discuss this here. And I assume that someone else had the same idea then me before. Maybe you can point me to the relevant discussion/ticket.
I read about Intels hybrid CPUs. It means there are multiple cores e.g. 8 high-speed cores and 8 low-speed (but more energy efficient) cores combined in one CPU.
In my use cases I do parallelize with Pythons multiprocessing package to work on millions of rows on pandas.DataFrame objects. This are task that are not vecotrizable. I simple cut the DataFrame horizontal in pieces (numbered by the available cores).
But when the cores are different in there "speed" I need to know that. e.g. with a 16 core CPU where half of the cores low/slow and every core has 1 million rows to work on. The 8 high speed cores are finishing earlier and just waiting untill the slow cores are finished. It would be more efficient if the 8 high speed cores each would work on 1,3 million rows and the low speed cores each on 0,7 million rows. It is not perfect but better. I know that they will not finish all at the same timepoint. But their end time will be closer together.
But to do this I need to know the type of the cores.
Am I wrong?
Are there any plans in the Python development taking this into account?
Kind Christian _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/C3BYES... Code of Conduct: http://python.org/psf/codeofconduct/
Dear Joan, Am 18.08.2021 14:36 schrieb Joao S. O. Bueno: As it stands however, is that you simply have to change your approach: instead of dividing yoru workload into different cores before starting, the common approach there is to set up worker processes, one per core, or per processor thread, and use those as a pool of resources to which you submit your processing work in chunks. In that way, if a worker happens to be in a faster core, it will be done with its chunk earlier and accept more work before slower cores are available. thanks for your feedback and your idea about the alternative solution with the ProcessPool. I think this I will use this for future projects. Thanks a lot. Kind Christian
On Wed, Aug 18, 2021 at 10:37 PM Joao S. O. Bueno <jsbueno@python.org.br> wrote:
So, It is out of scope of Pythonmultiprocessing, and, as I perceive it, from the stdlib as a whole to be able to allocate specific cores for each subprocess - that is automatically done by the O.S. (and of course, the O.S. having an interface for it, one can write a specific Python library which would allow this granularity, and it could even check core capabilities).
Python does have a way to set processor affinity, so it's entirely possible that this would be possible. Might need external tools though.
As it stands however, is that you simply have to change your approach: instead of dividing yoru workload into different cores before starting, the common approach there is to set up worker processes, one per core, or per processor thread, and use those as a pool of resources to which you submit your processing work in chunks. In that way, if a worker happens to be in a faster core, it will be done with its chunk earlier and accept more work before slower cores are available.
But I agree with this. Easiest to just subdivide further. ChrisA
On 18.08.2021 15:58, Chris Angelico wrote:
On Wed, Aug 18, 2021 at 10:37 PM Joao S. O. Bueno <jsbueno@python.org.br> wrote:
So, It is out of scope of Pythonmultiprocessing, and, as I perceive it, from the stdlib as a whole to be able to allocate specific cores for each subprocess - that is automatically done by the O.S. (and of course, the O.S. having an interface for it, one can write a specific Python library which would allow this granularity, and it could even check core capabilities).
Python does have a way to set processor affinity, so it's entirely possible that this would be possible. Might need external tools though.
There's os.sched_setaffinity(pid, mask) you could use from within a Python task scheduler, if this is managing child processes (you need the right permissions to set the affinity). Or you could use the taskset command available on Linux to fire up a process on a specific CPU core. lscpu gives you more insight into the installed set of available cores. multiprocessing itself does not have functionality to define the affinity upfront or to select which payload goes to which worker. I suppose you could implement a Pool subclass to handle such cases, though. Changing the calculation model is probably better, as already suggested. Having smaller chunks of work makes it easier to even out work load across workers in a cluster of different CPUs. You then don't have to worry about the details of the CPUs - you just need to play with the chunk size parameter a bit. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 18 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
On Thu, Aug 19, 2021 at 12:52 AM Marc-Andre Lemburg <mal@egenix.com> wrote:
On 18.08.2021 15:58, Chris Angelico wrote:
On Wed, Aug 18, 2021 at 10:37 PM Joao S. O. Bueno <jsbueno@python.org.br> wrote:
So, It is out of scope of Pythonmultiprocessing, and, as I perceive it, from the stdlib as a whole to be able to allocate specific cores for each subprocess - that is automatically done by the O.S. (and of course, the O.S. having an interface for it, one can write a specific Python library which would allow this granularity, and it could even check core capabilities).
Python does have a way to set processor affinity, so it's entirely possible that this would be possible. Might need external tools though.
There's os.sched_setaffinity(pid, mask) you could use from within a Python task scheduler, if this is managing child processes (you need the right permissions to set the affinity).
Right; I meant that it might require external tools to find out which processors you want to align with.
Or you could use the taskset command available on Linux to fire up a process on a specific CPU core. lscpu gives you more insight into the installed set of available cores.
Yes, those sorts of external tools. It MAY be possible to learn about processors by reading /proc/cpuinfo, but that'd still be OS-specific (no idea which Unix-like operating systems have that, and certainly Windows doesn't). All in all, far easier to just divide the job into far more pieces than you have processors, and then run a pool. ChrisA
On 18 Aug 2021, at 16:03, Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Aug 19, 2021 at 12:52 AM Marc-Andre Lemburg <mal@egenix.com> wrote:
On 18.08.2021 15:58, Chris Angelico wrote: On Wed, Aug 18, 2021 at 10:37 PM Joao S. O. Bueno <jsbueno@python.org.br> wrote:
So, It is out of scope of Pythonmultiprocessing, and, as I perceive it, from the stdlib as a whole to be able to allocate specific cores for each subprocess - that is automatically done by the O.S. (and of course, the O.S. having an interface for it, one can write a specific Python library which would allow this granularity, and it could even check core capabilities).
Python does have a way to set processor affinity, so it's entirely possible that this would be possible. Might need external tools though.
There's os.sched_setaffinity(pid, mask) you could use from within a Python task scheduler, if this is managing child processes (you need the right permissions to set the affinity).
Right; I meant that it might require external tools to find out which processors you want to align with.
Or you could use the taskset command available on Linux to fire up a process on a specific CPU core. lscpu gives you more insight into the installed set of available cores.
Yes, those sorts of external tools.
It MAY be possible to learn about processors by reading /proc/cpuinfo, but that'd still be OS-specific (no idea which Unix-like operating systems have that, and certainly Windows doesn't).
And next you find out that you have to understand the NUMA details of your system because the memory attached to the CPUs is not the same speed.
All in all, far easier to just divide the job into far more pieces than you have processors, and then run a pool.
As other already stated using a worker pool solves this problem for you. All you have to do it break your big job into suitable small pieces. Barry
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/UQNSUS... Code of Conduct: http://python.org/psf/codeofconduct/
The worker pool approach is probably the way to go, but there is a fair bit of overhead to creating a multiprocessing job. So fewer, larger jobs are faster than many small jobs. So you do want to make the jobs as large as you can without wasting CPU time. -CHB On Wed, Aug 18, 2021 at 9:09 AM Barry <barry@barrys-emacs.org> wrote:
On 18 Aug 2021, at 16:03, Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Aug 19, 2021 at 12:52 AM Marc-Andre Lemburg <mal@egenix.com> wrote:
On 18.08.2021 15:58, Chris Angelico wrote: On Wed, Aug 18, 2021 at 10:37 PM Joao S. O. Bueno <
So, It is out of scope of Pythonmultiprocessing, and, as I perceive it,
from
the stdlib as a whole to be able to allocate specific cores for each subprocess - that is automatically done by the O.S. (and of course, the O.S. having an interface for it, one can write a specific Python library which would allow
jsbueno@python.org.br> wrote: this granularity,
and it could even check core capabilities).
Python does have a way to set processor affinity, so it's entirely possible that this would be possible. Might need external tools though.
There's os.sched_setaffinity(pid, mask) you could use from within a Python task scheduler, if this is managing child processes (you need the right permissions to set the affinity).
Right; I meant that it might require external tools to find out which processors you want to align with.
Or you could use the taskset command available on Linux to fire up a process on a specific CPU core. lscpu gives you more insight into the installed set of available cores.
Yes, those sorts of external tools.
It MAY be possible to learn about processors by reading /proc/cpuinfo, but that'd still be OS-specific (no idea which Unix-like operating systems have that, and certainly Windows doesn't).
And next you find out that you have to understand the NUMA details of your system because the memory attached to the CPUs is not the same speed.
All in all, far easier to just divide the job into far more pieces than you have processors, and then run a pool.
As other already stated using a worker pool solves this problem for you. All you have to do it break your big job into suitable small pieces.
Barry
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/UQNSUS...
Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/62AXMS... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Christopher Barker writes:
The worker pool approach is probably the way to go, but there is a fair bit of overhead to creating a multiprocessing job. So fewer, larger jobs are faster than many small jobs.
True, but processing those rows would have to be awfully fast for the increase in overhead from 16 chunks x 10^6 rows/chunk to 64 chunks x 250,000 rows/chunk to matter, and that would be plenty granular to give a good approximation to his 2 chunks by fast core : 1 chunk by slow core nominal goal with a single queue, multiple workers approach. (Of course, it almost certainly will do a lot better, since 2 : 1 was itself a very rough approximation, but the single queue approach adjusts to speed differences automatically.) And if it's that fast, he could do it on a single core, and still done by the time he's finished savoring a sip of coffee. ;-) Steve
Would a work stealing approach work better for you here? Then the only signalling overhead would be when a core runs out of work On Thu, 19 Aug 2021, 05:36 Stephen J. Turnbull, < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Christopher Barker writes:
The worker pool approach is probably the way to go, but there is a fair bit of overhead to creating a multiprocessing job. So fewer, larger jobs are faster than many small jobs.
True, but processing those rows would have to be awfully fast for the increase in overhead from 16 chunks x 10^6 rows/chunk to 64 chunks x 250,000 rows/chunk to matter, and that would be plenty granular to give a good approximation to his 2 chunks by fast core : 1 chunk by slow core nominal goal with a single queue, multiple workers approach. (Of course, it almost certainly will do a lot better, since 2 : 1 was itself a very rough approximation, but the single queue approach adjusts to speed differences automatically.)
And if it's that fast, he could do it on a single core, and still done by the time he's finished savoring a sip of coffee. ;-)
Steve _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/TCC7ZZ... Code of Conduct: http://python.org/psf/codeofconduct/
Thomas Grainger writes:
Would a work stealing approach work better for you here? Then the only signalling overhead would be when a core runs out of work
Not sure what you're talking about with "work stealing". It sounds conceptually more complex than the queue + worker pool approach, which is already implemented in both the threading and multiprocessing modules. The overhead of creating hundreds of multiprocessing tasks is going to be barely human-perceptible. The other "overhead" is the programmer effort in assembling the finished product (assuming order matters, or there are interdependencies between chunks that require keeping per-chunk state). But I don't see how such programmer effort would be much greater for the "many chunks in a queue" approach vs. the chunk-per-core approach. So it seems to me that multiprocessing with a worker pool is a low programmer effort, very high efficiency gain approach to this problem. The remaining question is "how many chunks?" If that's relevant, ISTM a few simple experiments will show where the sweet spot is. Try a queue of 64 chunks, then 128 chunks, and refine guesses from there. I may be missing something, but that's the thinking that led to my previous post. Steve
I have to say thank you very much for all your experience and hints. It helps me a lot just talking with real professionals! ;) On 2021-08-19 23:32 "Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
The remaining question is "how many chunks?" If that's relevant, ISTM a few simple experiments will show where the sweet spot is. Try a queue of 64 chunks, then 128 chunks, and refine guesses from there.
That brings me to another side-question which brings me to an unanswered bug-report https://bugs.python.org/issue44901 This was reported as a bug by me because I see this as a problem with documentation. I simply tried to understand how processes transfering data between each other. I know they pickle. But how exactly? Which pickle protocol they use by default? Do they decide the protocol depending on the type/kind/structure of data? Do they compress the pickled data? e.g. I read a PEP about pickle version 5 which is relevant for large data like pandas.DataFrames. Knowing more about this would help me to understand the needed ressources for creating a process and transfering data to and from it. The RAM itself does not take into account. We assume here I have enough RAM.
On Fri, Aug 20, 2021 at 5:22 PM <c.buhtz@posteo.jp> wrote:
I simply tried to understand how processes transfering data between each other. I know they pickle. But how exactly? Which pickle protocol they use by default? Do they decide the protocol depending on the type/kind/structure of data? Do they compress the pickled data? e.g. I read a PEP about pickle version 5 which is relevant for large data like pandas.DataFrames.
pickle.DEFAULT_PROTOCOL would be my first guess :) When you're curious about this sort of thing, I strongly recommend browsing the CPython source code. Sometimes, you'll end up with a follow-up question "is this a language guarantee?", but at very least, you'll know how the most-used Python implementation does things. Don't be put off by the "C" in CPython; a lot of the standard library is implemented in Python, including the entire multiprocessing module: https://github.com/python/cpython/tree/main/Lib/multiprocessing A quick search for the word "pickle" shows this as a promising start: https://github.com/python/cpython/blob/main/Lib/multiprocessing/reduction.py ChrisA
On 20.08.2021 09:30, Chris Angelico wrote:
On Fri, Aug 20, 2021 at 5:22 PM <c.buhtz@posteo.jp> wrote:
I simply tried to understand how processes transfering data between each other. I know they pickle. But how exactly? Which pickle protocol they use by default? Do they decide the protocol depending on the type/kind/structure of data? Do they compress the pickled data? e.g. I read a PEP about pickle version 5 which is relevant for large data like pandas.DataFrames.
pickle.DEFAULT_PROTOCOL would be my first guess :)
When you're curious about this sort of thing, I strongly recommend browsing the CPython source code. Sometimes, you'll end up with a follow-up question "is this a language guarantee?", but at very least, you'll know how the most-used Python implementation does things.
Don't be put off by the "C" in CPython; a lot of the standard library is implemented in Python, including the entire multiprocessing module:
https://github.com/python/cpython/tree/main/Lib/multiprocessing
A quick search for the word "pickle" shows this as a promising start:
https://github.com/python/cpython/blob/main/Lib/multiprocessing/reduction.py
Chris is pointing to the right resources. In Python 3.9, pickle writes the format 4.0 per default and the reduction mechanism in multiprocessing always uses the default, since even though it subclasses the Pickler class, the protocol variable is not touched. See https://github.com/python/cpython/blob/3.9/Lib/pickle.py for details. Aside: If you're dealing with data frames, there are a few alternative tools to consider apart from multiprocessing: - Prefect: https://www.prefect.io/ - Dask: https://dask.org/ - MPI: https://mpi4py.readthedocs.io/en/stable/ If you have a GPU available, you can also try these frameworks: - RAPIDS: https://rapids.ai/ - HeAT: https://heat.readthedocs.io/en/latest/ Those tools will do a lot more than multiprocessing and require extra effort to get up and running, but on the plus side, you don't have to worry about things like pickling protocols anymore :-) If you want to explore the other direction and create an optimized multiprocessing library, replacing pickle with e.g. Arrow would give you some advantages: - pyarrow: https://pypi.org/project/pyarrow/ Alternatively, don't even pass data chunks around per in-process memory, but instead have your workers read them from (RAM) disk by converting them to one of the more efficient formats for this, e.g. - Parquet: https://github.com/dask/fastparquet or place the data into shared memory using one of those formats. Reading Parquet files is much faster than reading CSV or pickle files. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 20 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
participants (8)
-
Barry
-
c.buhtz@posteo.jp
-
Chris Angelico
-
Christopher Barker
-
Joao S. O. Bueno
-
Marc-Andre Lemburg
-
Stephen J. Turnbull
-
Thomas Grainger