[Python-ideas] Re: multiprocessing: hybrid CPUs

20 Aug 2021

      On 20.08.2021 09:30, Chris Angelico wrote:
...
On Fri, Aug 20, 2021 at 5:22 PM  wrote:
...
I simply tried to understand how processes transfering data between
each other. I know they pickle. But how exactly? Which pickle protocol
they use by default? Do they decide the protocol depending on the
type/kind/structure of data? Do they compress the pickled data?
e.g. I read a PEP about pickle version 5 which is relevant for large
data like pandas.DataFrames.
pickle.DEFAULT_PROTOCOL would be my first guess :)
When you're curious about this sort of thing, I strongly recommend
browsing the CPython source code. Sometimes, you'll end up with a
follow-up question "is this a language guarantee?", but at very least,
you'll know how the most-used Python implementation does things.
Don't be put off by the "C" in CPython; a lot of the standard library
is implemented in Python, including the entire multiprocessing module:
https://github.com/python/cpython/tree/main/Lib/multiprocessing
A quick search for the word "pickle" shows this as a promising start:
https://github.com/python/cpython/blob/main/Lib/multiprocessing/reduction.py
Chris is pointing to the right resources.

In Python 3.9, pickle writes the format 4.0 per default and the reduction
mechanism in multiprocessing always uses the default, since even though
it subclasses the Pickler class, the protocol variable is not touched.

See https://github.com/python/cpython/blob/3.9/Lib/pickle.py for details.

Aside:

If you're dealing with data frames, there are a few alternative
tools to consider apart from multiprocessing:

- Prefect: https://www.prefect.io/
- Dask: https://dask.org/
- MPI: https://mpi4py.readthedocs.io/en/stable/

If you have a GPU available, you can also try these frameworks:

- RAPIDS: https://rapids.ai/
- HeAT: https://heat.readthedocs.io/en/latest/

Those tools will do a lot more than multiprocessing and require
extra effort to get up and running, but on the plus side, you
don't have to worry about things like pickling protocols
anymore :-)

If you want to explore the other direction and create an optimized
multiprocessing library, replacing pickle with e.g. Arrow would
give you some advantages:

- pyarrow: https://pypi.org/project/pyarrow/

Alternatively, don't even pass data chunks around per in-process memory,
but instead have your workers read them from (RAM) disk by converting
them to one of the more efficient formats for this, e.g.

- Parquet: https://github.com/dask/fastparquet

or place the data into shared memory using one of those formats.

Reading Parquet files is much faster than reading CSV or pickle
files.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Aug 20 2021)
...
...
...
Python Projects, Coaching and Support ...    https://www.egenix.com/
Python Product Development ...        https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               https://www.egenix.com/company/contact/
                     https://www.malemburg.com/