On 20.08.2021 09:30, Chris Angelico wrote:
On Fri, Aug 20, 2021 at 5:22 PM
wrote: I simply tried to understand how processes transfering data between each other. I know they pickle. But how exactly? Which pickle protocol they use by default? Do they decide the protocol depending on the type/kind/structure of data? Do they compress the pickled data? e.g. I read a PEP about pickle version 5 which is relevant for large data like pandas.DataFrames.
pickle.DEFAULT_PROTOCOL would be my first guess :)
When you're curious about this sort of thing, I strongly recommend browsing the CPython source code. Sometimes, you'll end up with a follow-up question "is this a language guarantee?", but at very least, you'll know how the most-used Python implementation does things.
Don't be put off by the "C" in CPython; a lot of the standard library is implemented in Python, including the entire multiprocessing module:
https://github.com/python/cpython/tree/main/Lib/multiprocessing
A quick search for the word "pickle" shows this as a promising start:
https://github.com/python/cpython/blob/main/Lib/multiprocessing/reduction.py
Chris is pointing to the right resources. In Python 3.9, pickle writes the format 4.0 per default and the reduction mechanism in multiprocessing always uses the default, since even though it subclasses the Pickler class, the protocol variable is not touched. See https://github.com/python/cpython/blob/3.9/Lib/pickle.py for details. Aside: If you're dealing with data frames, there are a few alternative tools to consider apart from multiprocessing: - Prefect: https://www.prefect.io/ - Dask: https://dask.org/ - MPI: https://mpi4py.readthedocs.io/en/stable/ If you have a GPU available, you can also try these frameworks: - RAPIDS: https://rapids.ai/ - HeAT: https://heat.readthedocs.io/en/latest/ Those tools will do a lot more than multiprocessing and require extra effort to get up and running, but on the plus side, you don't have to worry about things like pickling protocols anymore :-) If you want to explore the other direction and create an optimized multiprocessing library, replacing pickle with e.g. Arrow would give you some advantages: - pyarrow: https://pypi.org/project/pyarrow/ Alternatively, don't even pass data chunks around per in-process memory, but instead have your workers read them from (RAM) disk by converting them to one of the more efficient formats for this, e.g. - Parquet: https://github.com/dask/fastparquet or place the data into shared memory using one of those formats. Reading Parquet files is much faster than reading CSV or pickle files. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 20 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/