PEP 574 -- Pickle protocol 5 with out-of-band data

Hi, I'd like to submit this PEP for discussion. It is quite specialized and the main target audience of the proposed changes is users and authors of applications/libraries transferring large amounts of data (read: the scientific computing & data science ecosystems). https://www.python.org/dev/peps/pep-0574/ The PEP text is also inlined below. Regards Antoine. PEP: 574 Title: Pickle protocol 5 with out-of-band data Version: $Revision$ Last-Modified: $Date$ Author: Antoine Pitrou <solipsis@pitrou.net> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 23-Mar-2018 Post-History: Resolution: Abstract ======== This PEP proposes to standardize a new pickle protocol version, and accompanying APIs to take full advantage of it: 1. A new pickle protocol version (5) to cover the extra metadata needed for out-of-band data buffers. 2. A new ``PickleBuffer`` type for ``__reduce_ex__`` implementations to return out-of-band data buffers. 3. A new ``buffer_callback`` parameter when pickling, to handle out-of-band data buffers. 4. A new ``buffers`` parameter when unpickling to provide out-of-band data buffers. The PEP guarantees unchanged behaviour for anyone not using the new APIs. Rationale ========= The pickle protocol was originally designed in 1995 for on-disk persistency of arbitrary Python objects. The performance of a 1995-era storage medium probably made it irrelevant to focus on performance metrics such as use of RAM bandwidth when copying temporary data before writing it to disk. Nowadays the pickle protocol sees a growing use in applications where most of the data isn't ever persisted to disk (or, when it is, it uses a portable format instead of Python-specific). Instead, pickle is being used to transmit data and commands from one process to another, either on the same machine or on multiple machines. Those applications will sometimes deal with very large data (such as Numpy arrays or Pandas dataframes) that need to be transferred around. For those applications, pickle is currently wasteful as it imposes spurious memory copies of the data being serialized. As a matter of fact, the standard ``multiprocessing`` module uses pickle for serialization, and therefore also suffers from this problem when sending large data to another process. Third-party Python libraries, such as Dask [#dask]_, PyArrow [#pyarrow]_ and IPyParallel [#ipyparallel]_, have started implementing alternative serialization schemes with the explicit goal of avoiding copies on large data. Implementing a new serialization scheme is difficult and often leads to reduced generality (since many Python objects support pickle but not the new serialization scheme). Falling back on pickle for unsupported types is an option, but then you get back the spurious memory copies you wanted to avoid in the first place. For example, ``dask`` is able to avoid memory copies for Numpy arrays and built-in containers thereof (such as lists or dicts containing Numpy arrays), but if a large Numpy array is an attribute of a user-defined object, ``dask`` will serialize the user-defined object as a pickle stream, leading to memory copies. The common theme of these third-party serialization efforts is to generate a stream of object metadata (which contains pickle-like information about the objects being serialized) and a separate stream of zero-copy buffer objects for the payloads of large objects. Note that, in this scheme, small objects such as ints, etc. can be dumped together with the metadata stream. Refinements can include opportunistic compression of large data depending on its type and layout, like ``dask`` does. This PEP aims to make ``pickle`` usable in a way where large data is handled as a separate stream of zero-copy buffers, letting the application handle those buffers optimally. Example ======= To keep the example simple and avoid requiring knowledge of third-party libraries, we will focus here on a bytearray object (but the issue is conceptually the same with more sophisticated objects such as Numpy arrays). Like most objects, the bytearray object isn't immediately understood by the pickle module and must therefore specify its decomposition scheme. Here is how a bytearray object currently decomposes for pickling::
b.__reduce_ex__(4) (<class 'bytearray'>, (b'abc',), None)
This is because the ``bytearray.__reduce_ex__`` implementation reads morally as follows:: class bytearray: def __reduce_ex__(self, protocol): if protocol == 4: return type(self), bytes(self), None # Legacy code for earlier protocols omitted In turn it produces the following pickle code::
pickletools.dis(pickletools.optimize(pickle.dumps(b, protocol=4))) 0: \x80 PROTO 4 2: \x95 FRAME 30 11: \x8c SHORT_BINUNICODE 'builtins' 21: \x8c SHORT_BINUNICODE 'bytearray' 32: \x93 STACK_GLOBAL 33: C SHORT_BINBYTES b'abc' 38: \x85 TUPLE1 39: R REDUCE 40: . STOP
(the call to ``pickletools.optimize`` above is only meant to make the pickle stream more readable by removing the MEMOIZE opcodes) We can notice several things about the bytearray's payload (the sequence of bytes ``b'abc'``): * ``bytearray.__reduce_ex__`` produces a first copy by instantiating a new bytes object from the bytearray's data. * ``pickle.dumps`` produces a second copy when inserting the contents of that bytes object into the pickle stream, after the SHORT_BINBYTES opcode. * Furthermore, when deserializing the pickle stream, a temporary bytes object is created when the SHORT_BINBYTES opcode is encountered (inducing a data copy). What we really want is something like the following: * ``bytearray.__reduce_ex__`` produces a *view* of the bytearray's data. * ``pickle.dumps`` doesn't try to copy that data into the pickle stream but instead passes the buffer view to its caller (which can decide on the most efficient handling of that buffer). * When deserializing, ``pickle.loads`` takes the pickle stream and the buffer view separately, and passes the buffer view directly to the bytearray constructor. We see that several conditions are required for the above to work: * ``__reduce__`` or ``__reduce_ex__`` must be able to return *something* that indicates a serializable no-copy buffer view. * The pickle protocol must be able to represent references to such buffer views, instructing the unpickler that it may have to get the actual buffer out of band. * The ``pickle.Pickler`` API must provide its caller with a way to receive such buffer views while serializing. * The ``pickle.Unpickler`` API must similarly allow its caller to provide the buffer views required for deserialization. * For compatibility, the pickle protocol must also be able to contain direct serializations of such buffer views, such that current uses of the ``pickle`` API don't have to be modified if they are not concerned with memory copies. Producer API ============ We are introducing a new type ``pickle.PickleBuffer`` which can be instantiated from any buffer-supporting object, and is specifically meant to be returned from ``__reduce__`` implementations:: class bytearray: def __reduce_ex__(self, protocol): if protocol == 5: return type(self), PickleBuffer(self), None # Legacy code for earlier protocols omitted ``PickleBuffer`` is a simple wrapper that doesn't have all the memoryview semantics and functionality, but is specifically recognized by the ``pickle`` module if protocol 5 or higher is enabled. It is an error to try to serialize a ``PickleBuffer`` with pickle protocol version 4 or earlier. Only the raw *data* of the ``PickleBuffer`` will be considered by the ``pickle`` module. Any type-specific *metadata* (such as shapes or datatype) must be returned separately by the type's ``__reduce__`` implementation, as is already the case. PickleBuffer objects -------------------- The ``PickleBuffer`` class supports a very simple Python API. Its constructor takes a single PEP 3118-compatible object [#pep-3118]_. ``PickleBuffer`` objects themselves support the buffer protocol, so consumers can call ``memoryview(...)`` on them to get additional information about the underlying buffer (such as the original type, shape, etc.). On the C side, a simple API will be provided to create and inspect PickleBuffer objects: ``PyObject *PyPickleBuffer_FromObject(PyObject *obj)`` Create a ``PickleBuffer`` object holding a view over the PEP 3118-compatible *obj*. ``PyPickleBuffer_Check(PyObject *obj)`` Return whether *obj* is a ``PickleBuffer`` instance. ``const Py_buffer *PyPickleBuffer_GetBuffer(PyObject *picklebuf)`` Return a pointer to the internal ``Py_buffer`` owned by the ``PickleBuffer`` instance. ``PickleBuffer`` can wrap any kind of buffer, including non-contiguous buffers. It's up to consumers to decide how best to handle different kinds of buffers (for example, some consumers may find it acceptable to make a contiguous copy of non-contiguous buffers). Consumer API ============ ``pickle.Pickler.__init__`` and ``pickle.dumps`` are augmented with an additional ``buffer_callback`` parameter:: class Pickler: def __init__(self, file, protocol=None, ..., buffer_callback=None): """ If *buffer_callback* is not None, then it is called with a list of out-of-band buffer views when deemed necessary (this could be once every buffer, or only after a certain size is reached, or once at the end, depending on implementation details). The callback should arrange to store or transmit those buffers without changing their order. If *buffer_callback* is None (the default), buffer views are serialized into *file* as part of the pickle stream. It is an error if *buffer_callback* is not None and *protocol* is None or smaller than 5. """ def pickle.dumps(obj, protocol=None, *, ..., buffer_callback=None): """ See above for *buffer_callback*. """ ``pickle.Unpickler.__init__`` and ``pickle.loads`` are augmented with an additional ``buffers`` parameter:: class Unpickler: def __init__(file, *, ..., buffers=None): """ If *buffers* is not None, it should be an iterable of buffer-enabled objects that is consumed each time the pickle stream references an out-of-band buffer view. Such buffers have been given in order to the *buffer_callback* of a Pickler object. If *buffers* is None (the default), then the buffers are taken from the pickle stream, assuming they are serialized there. It is an error for *buffers* to be None if the pickle stream was produced with a non-None *buffer_callback*. """ def pickle.loads(data, *, ..., buffers=None): """ See above for *buffers*. """ Protocol changes ================ Three new opcodes are introduced: * ``BYTEARRAY`` creates a bytearray from the data following it in the pickle stream and pushes it on the stack (just like ``BINBYTES8`` does for bytes objects); * ``NEXT_BUFFER`` fetches a buffer from the ``buffers`` iterable and pushes it on the stack. * ``READONLY_BUFFER`` makes a readonly view of the top of the stack. When pickling encounters a ``PickleBuffer``, there can be four cases: * If a ``buffer_callback`` is given and the ``PickleBuffer`` is writable, the ``PickleBuffer`` is given to the callback and a ``NEXT_BUFFER`` opcode is appended to the pickle stream. * If a ``buffer_callback`` is given and the ``PickleBuffer`` is readonly, the ``PickleBuffer`` is given to the callback and a ``NEXT_BUFFER`` opcode is appended to the pickle stream, followed by a ``READONLY_BUFFER`` opcode. * If no ``buffer_callback`` is given and the ``PickleBuffer`` is writable, it is serialized into the pickle stream as if it were a ``bytearray`` object. * If no ``buffer_callback`` is given and the ``PickleBuffer`` is readonly, it is serialized into the pickle stream as if it were a ``bytes`` object. The distinction between readonly and writable buffers is explained below (see "Mutability"). Caveats ======= Mutability ---------- PEP 3118 buffers [#pep-3118]_ can be readonly or writable. Some objects, such as Numpy arrays, need to be backed by a mutable buffer for full operation. Pickle consumers that use the ``buffer_callback`` and ``buffers`` arguments will have to be careful to recreate mutable buffers. When doing I/O, this implies using buffer-passing API variants such as ``readinto`` (which are also often preferrable for performance). Data sharing ------------ If you pickle and then unpickle an object in the same process, passing out-of-band buffer views, then the unpickled object may be backed by the same buffer as the original pickled object. For example, it might be reasonable to implement reduction of a Numpy array as follows (crucial metadata such as shapes is omitted for simplicity):: class ndarray: def __reduce_ex__(self, protocol): if protocol == 5: return numpy.frombuffer, (PickleBuffer(self), self.dtype) # Legacy code for earlier protocols omitted Then simply passing the PickleBuffer around from ``dumps`` to ``loads`` will produce a new Numpy array sharing the same underlying memory as the original Numpy object (and, incidentally, keeping it alive)::
import numpy as np a = np.zeros(10) a[0] 0.0 buffers = [] data = pickle.dumps(a, protocol=5, buffer_callback=buffers.extend) b = pickle.loads(data, buffers=buffers) b[0] = 42 a[0] 42.0
This won't happen with the traditional ``pickle`` API (i.e. without passing ``buffers`` and ``buffer_callback`` parameters), because then the buffer view is serialized inside the pickle stream with a copy. Alternatives ============ The ``pickle`` persistence interface is a way of storing references to designated objects in the pickle stream while handling their actual serialization out of band. For example, one might consider the following for zero-copy serialization of bytearrays:: class MyPickle(pickle.Pickler): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.buffers = [] def persistent_id(self, obj): if type(obj) is not bytearray: return None else: index = len(self.buffers) self.buffers.append(obj) return ('bytearray', index) class MyUnpickle(pickle.Unpickler): def __init__(self, *args, buffers, **kwargs): super().__init__(*args, **kwargs) self.buffers = buffers def persistent_load(self, pid): type_tag, index = pid if type_tag == 'bytearray': return self.buffers[index] else: assert 0 # unexpected type This mechanism has two drawbacks: * Each ``pickle`` consumer must reimplement ``Pickler`` and ``Unpickler`` subclasses, with custom code for each type of interest. Essentially, N pickle consumers end up each implementing custom code for M producers. This is difficult (especially for sophisticated types such as Numpy arrays) and poorly scalable. * Each object encountered by the pickle module (even simple built-in objects such as ints and strings) triggers a call to the user's ``persistent_id()`` method, leading to a possible performance drop compared to nominal. Open questions ============== Should ``buffer_callback`` take a single buffers or a sequence of buffers? * Taking a single buffer would allow returning a boolean indicating whether the given buffer is serialized in-band or out-of-band. * Taking a sequence of buffers is potentially more efficient by reducing function call overhead. Related work ============ Dask.distributed implements a custom zero-copy serialization with fallback to pickle [#dask-serialization]_. PyArrow implements zero-copy component-based serialization for a few selected types [#pyarrow-serialization]_. PEP 554 proposes hosting multiple interpreters in a single process, with provisions for transferring buffers between interpreters as a communication scheme [#pep-554]_. Acknowledgements ================ Thanks to the following people for early feedback: Nick Coghlan, Olivier Grisel, Stefan Krah, MinRK, Matt Rocklin, Eric Snow. References ========== .. [#dask] Dask.distributed -- A lightweight library for distributed computing in Python https://distributed.readthedocs.io/ .. [#dask-serialization] Dask.distributed custom serialization https://distributed.readthedocs.io/en/latest/serialization.html .. [#ipyparallel] IPyParallel -- Using IPython for parallel computing https://ipyparallel.readthedocs.io/ .. [#pyarrow] PyArrow -- A cross-language development platform for in-memory data https://arrow.apache.org/docs/python/ .. [#pyarrow-serialization] PyArrow IPC and component-based serialization https://arrow.apache.org/docs/python/ipc.html#component-based-serialization .. [#pep-3118] PEP 3118 -- Revising the buffer protocol https://www.python.org/dev/peps/pep-3118/ .. [#pep-554] PEP 554 -- Multiple Interpreters in the Stdlib https://www.python.org/dev/peps/pep-0554/ Copyright ========= This document has been placed into the public domain.

28.03.18 21:39, Antoine Pitrou пише:
I'd like to submit this PEP for discussion. It is quite specialized and the main target audience of the proposed changes is users and authors of applications/libraries transferring large amounts of data (read: the scientific computing & data science ecosystems).
Currently I'm working on porting some features from cloudpickle to the stdlib. For these of them which can't or shouldn't be implemented in the general purpose library (like serializing local functions by serializing their code objects, because it is not portable) I want to add hooks that would allow to implement them in cloudpickle using official API. This would allow cloudpickle to utilize C implementation of the pickler and unpickler. There is a private module _compat_pickle for supporting compatibility of moved stdlib classes with Python 2. I'm going to provide public API that would allow third-party libraries to support compatibility for moved classes and functions. This could also help to support classes and function moved in the stdlib after 3.0. It is well known that pickle is unsafe. Unpickling untrusted data can cause executing arbitrary code. It is less known that unpickling can be made safe by controlling resolution of global names in custom Unpickler.find_class(). I want to provide helpers which would help implementing safe unpickling by specifying just white lists of globals and attributes. This work still is not finished, but I think it is worth to include it in protocol 5 if some features will need bumping protocol version.

On Wed, 28 Mar 2018 23:03:08 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
28.03.18 21:39, Antoine Pitrou пише:
I'd like to submit this PEP for discussion. It is quite specialized and the main target audience of the proposed changes is users and authors of applications/libraries transferring large amounts of data (read: the scientific computing & data science ecosystems).
Currently I'm working on porting some features from cloudpickle to the stdlib. For these of them which can't or shouldn't be implemented in the general purpose library (like serializing local functions by serializing their code objects, because it is not portable) I want to add hooks that would allow to implement them in cloudpickle using official API. This would allow cloudpickle to utilize C implementation of the pickler and unpickler.
Yes, that's something that would benefit a lot of people. For the record, here are my notes on the topic: https://github.com/cloudpipe/cloudpickle/issues/58#issuecomment-339751408
It is well known that pickle is unsafe. Unpickling untrusted data can cause executing arbitrary code. It is less known that unpickling can be made safe by controlling resolution of global names in custom Unpickler.find_class(). I want to provide helpers which would help implementing safe unpickling by specifying just white lists of globals and attributes.
I'm not sure how safe that would be, because 1) there may be other attack vectors, and 2) it's difficult to predict which functions are entirely safe for calling. I think the best way to make pickles safe is to cryptographically sign them so that they cannot be forged by an attacker.
This work still is not finished, but I think it is worth to include it in protocol 5 if some features will need bumping protocol version.
Agreed. Do you know by which timeframe you'll know which opcodes you want to add? Regards Antoine.

28.03.18 23:19, Antoine Pitrou пише:
Agreed. Do you know by which timeframe you'll know which opcodes you want to add?
I'm currently in the middle of the first part, trying to implement pickling local classes with static and class methods without creating loops. Other parts exist just like general ideas, I didn't rite code for them still. I try to do this with existing protocols, but maybe some new opcodes will be needed for efficiency. We are now at the early stage of 3.8 developing, and I think we have a lot of time. It wouldn't deserve bumping pickle version, but if we do this already, it would be worth to add shorter versions for FRAME. Currently it uses 64-bit size, and 9 bytes is a large overhead for short pickles. 8-bit size would reduce overhead for short pickles, and 32-bit size would be enough for any practical use (larger data is not wrapped in a frame).

Hi Serhiy, Do you have any bug / issue to track the work you want to do to add native pickling support for locally defined function and classes by serializing the code objects like cloudpickle does? Is this work public on some git branch on GitHub or somewhere else? Cheers, -- Olivier

On Wed, Mar 28, 2018 at 1:03 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:
28.03.18 21:39, Antoine Pitrou пише:
I'd like to submit this PEP for discussion. It is quite specialized and the main target audience of the proposed changes is users and authors of applications/libraries transferring large amounts of data (read: the scientific computing & data science ecosystems).
Currently I'm working on porting some features from cloudpickle to the stdlib. For these of them which can't or shouldn't be implemented in the general purpose library (like serializing local functions by serializing their code objects, because it is not portable) I want to add hooks that would allow to implement them in cloudpickle using official API. This would allow cloudpickle to utilize C implementation of the pickler and unpickler.
There's obviously some tension here between pickle's use as a persistent storage format, and its use as a transient wire format. For the former, you definitely can't store code objects because there's no forwards- or backwards-compatibility guarantee for bytecode. But for the latter, transmitting bytecode is totally fine, because all you care about is whether it can be decoded once, right now, by some peer process whose python version you can control -- that's why cloudpickle exists. Would it make sense to have a special pickle version that the transient wire format users could opt into, that only promises compatibility within a given 3.X release cycle? Like version=-2 or version=pickle.NONPORTABLE or something? (This is orthogonal to Antoine's PEP.) -n -- Nathaniel J. Smith -- https://vorpus.org

On 3/28/2018 9:15 PM, Nathaniel Smith wrote:
There's obviously some tension here between pickle's use as a persistent storage format, and its use as a transient wire format. For the former, you definitely can't store code objects because there's no forwards- or backwards-compatibility guarantee for bytecode. But for the latter, transmitting bytecode is totally fine, because all you care about is whether it can be decoded once, right now, by some peer process whose python version you can control -- that's why cloudpickle exists.
An interesting observation. IDLE compiles user code in the user process to check for syntax errors. idlelib.rpc subclasses Pickler to pickle the resulting code objects via marshal.dumps so it can send them to the user code execution subprocess.
Would it make sense to have a special pickle version that the transient wire format users could opt into, that only promises compatibility within a given 3.X release cycle? Like version=-2 or version=pickle.NONPORTABLE or something?
(This is orthogonal to Antoine's PEP.)
-- Terry Jan Reedy

On Wed, Mar 28, 2018 at 6:15 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, Mar 28, 2018 at 1:03 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:
28.03.18 21:39, Antoine Pitrou пише:
I'd like to submit this PEP for discussion. It is quite specialized and the main target audience of the proposed changes is users and authors of applications/libraries transferring large amounts of data (read: the scientific computing & data science ecosystems).
Currently I'm working on porting some features from cloudpickle to the stdlib. For these of them which can't or shouldn't be implemented in the general purpose library (like serializing local functions by serializing their code objects, because it is not portable) I want to add hooks that would allow to implement them in cloudpickle using official API. This would allow cloudpickle to utilize C implementation of the pickler and unpickler.
There's obviously some tension here between pickle's use as a persistent storage format, and its use as a transient wire format. For the former, you definitely can't store code objects because there's no forwards- or backwards-compatibility guarantee for bytecode. But for the latter, transmitting bytecode is totally fine, because all you care about is whether it can be decoded once, right now, by some peer process whose python version you can control -- that's why cloudpickle exists.
Is it really true you'll always be able to control the Python version on the other side? Even if they're internal services, it seems like there could be times / reasons preventing you from upgrading the environment of all of your services at the same rate. Or did you mean to say "often" all you care about ...? --Chris
Would it make sense to have a special pickle version that the transient wire format users could opt into, that only promises compatibility within a given 3.X release cycle? Like version=-2 or version=pickle.NONPORTABLE or something?
(This is orthogonal to Antoine's PEP.)
-n
-- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.jerdonek%40gmail.co...

On Thu, Mar 29, 2018 at 12:56 AM, Chris Jerdonek <chris.jerdonek@gmail.com> wrote:
On Wed, Mar 28, 2018 at 6:15 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, Mar 28, 2018 at 1:03 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:
28.03.18 21:39, Antoine Pitrou пише:
I'd like to submit this PEP for discussion. It is quite specialized and the main target audience of the proposed changes is users and authors of applications/libraries transferring large amounts of data (read: the scientific computing & data science ecosystems).
Currently I'm working on porting some features from cloudpickle to the stdlib. For these of them which can't or shouldn't be implemented in the general purpose library (like serializing local functions by serializing their code objects, because it is not portable) I want to add hooks that would allow to implement them in cloudpickle using official API. This would allow cloudpickle to utilize C implementation of the pickler and unpickler.
There's obviously some tension here between pickle's use as a persistent storage format, and its use as a transient wire format. For the former, you definitely can't store code objects because there's no forwards- or backwards-compatibility guarantee for bytecode. But for the latter, transmitting bytecode is totally fine, because all you care about is whether it can be decoded once, right now, by some peer process whose python version you can control -- that's why cloudpickle exists.
Is it really true you'll always be able to control the Python version on the other side? Even if they're internal services, it seems like there could be times / reasons preventing you from upgrading the environment of all of your services at the same rate. Or did you mean to say "often" all you care about ...?
Yeah, maybe I spoke a little sloppily -- I'm sure there are cases where you're using pickle as a wire format between heterogenous interpreters, in which case you wouldn't use version=NONPORTABLE. But projects like dask, and everyone else who uses cloudpickle/dill, are already assuming homogenous interpreters. A typical way of using these kinds of systems is: you start your script, it spins up some cloud VMs or local cluster nodes (maybe sending them all a conda environment you made), they all chat for a while doing your computation, and then they spin down again and your script reports the results. So versioning and coordinated upgrades really aren't a thing you need to worry about :-). Another example is the multiprocessing module: it's very safe to assume that the parent and the child are using the same interpreter :-). There's no fundamental reason you shouldn't be able to send bytecode between them. Pickle's not really the ideal wire format for persistent services anyway, given the arbitrary code execution and tricky versioning -- even if you aren't playing games with bytecode, pickle still assumes that if two classes in two different interpreters have the same name, then their internal implementation details are all the same. You can make it work, but usually there are better options. It's perfect though for multi-core and multi-machine parallelism. -n -- Nathaniel J. Smith -- https://vorpus.org

On Thu, Mar 29, 2018 at 7:18 PM, Nathaniel Smith <njs@pobox.com> wrote:
Another example is the multiprocessing module: it's very safe to assume that the parent and the child are using the same interpreter :-). There's no fundamental reason you shouldn't be able to send bytecode between them.
You put a smiley on it, but is this actually guaranteed on all platforms? On Unix-like systems, presumably it's using fork() and thus will actually use the exact same binary, but what about on Windows, where a new process has to be spawned? Can you say "spawn me another of this exact binary blob", or do you have to identify it by a file name? It wouldn't be a problem for the nonportable mode to toss out an exception in weird cases like this, but it _would_ be a problem if that causes a segfault or something. ChrisA

On 29 March 2018 at 09:49, Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Mar 29, 2018 at 7:18 PM, Nathaniel Smith <njs@pobox.com> wrote:
Another example is the multiprocessing module: it's very safe to assume that the parent and the child are using the same interpreter :-). There's no fundamental reason you shouldn't be able to send bytecode between them.
You put a smiley on it, but is this actually guaranteed on all platforms? On Unix-like systems, presumably it's using fork() and thus will actually use the exact same binary, but what about on Windows, where a new process has to be spawned? Can you say "spawn me another of this exact binary blob", or do you have to identify it by a file name?
It wouldn't be a problem for the nonportable mode to toss out an exception in weird cases like this, but it _would_ be a problem if that causes a segfault or something.
If you're embedding, you need multiprocessing.set_executable() (https://docs.python.org/3.6/library/multiprocessing.html#multiprocessing.set...), so in that case you definitely *won't* have the same binary... Paul

On Thu, Mar 29, 2018 at 7:56 PM, Paul Moore <p.f.moore@gmail.com> wrote:
On 29 March 2018 at 09:49, Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Mar 29, 2018 at 7:18 PM, Nathaniel Smith <njs@pobox.com> wrote:
Another example is the multiprocessing module: it's very safe to assume that the parent and the child are using the same interpreter :-). There's no fundamental reason you shouldn't be able to send bytecode between them.
You put a smiley on it, but is this actually guaranteed on all platforms? On Unix-like systems, presumably it's using fork() and thus will actually use the exact same binary, but what about on Windows, where a new process has to be spawned? Can you say "spawn me another of this exact binary blob", or do you have to identify it by a file name?
It wouldn't be a problem for the nonportable mode to toss out an exception in weird cases like this, but it _would_ be a problem if that causes a segfault or something.
If you're embedding, you need multiprocessing.set_executable() (https://docs.python.org/3.6/library/multiprocessing.html#multiprocessing.set...), so in that case you definitely *won't* have the same binary...
Ah, and that also showed me that forking isn't mandatory on Unix either. So yeah, there's no assuming that they use the same binary. I doubt it'll be a problem to pickle though as it'll use some form of versioning even in NONPORTABLE mode right? ChrisA

On Thu, Mar 29, 2018, 02:02 Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Mar 29, 2018 at 7:56 PM, Paul Moore <p.f.moore@gmail.com> wrote:
On 29 March 2018 at 09:49, Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Mar 29, 2018 at 7:18 PM, Nathaniel Smith <njs@pobox.com> wrote:
Another example is the multiprocessing module: it's very safe to assume that the parent and the child are using the same interpreter :-). There's no fundamental reason you shouldn't be able to send bytecode between them.
You put a smiley on it, but is this actually guaranteed on all platforms? On Unix-like systems, presumably it's using fork() and thus will actually use the exact same binary, but what about on Windows, where a new process has to be spawned? Can you say "spawn me another of this exact binary blob", or do you have to identify it by a file name?
It wouldn't be a problem for the nonportable mode to toss out an exception in weird cases like this, but it _would_ be a problem if that causes a segfault or something.
If you're embedding, you need multiprocessing.set_executable() ( https://docs.python.org/3.6/library/multiprocessing.html#multiprocessing.set... ), so in that case you definitely *won't* have the same binary...
Ah, and that also showed me that forking isn't mandatory on Unix either. So yeah, there's no assuming that they use the same binary.
Normally it spawns children using `sys.executable`, which I think on Windows in particular is guaranteed to be the same binary that started the main process, because the OS locks the file while it's executing. But yeah, I didn't think about the embedding case, and apparently there's also a little-known set of features for using multiprocessing between arbitrary python processes: https://docs.python.org/3/library/multiprocessing.html#multiprocessing-liste...
I doubt it'll be a problem to pickle though as it'll use some form of versioning even in NONPORTABLE mode right?
I guess the (merged, but undocumented?) changes in https://bugs.python.org/issue28053 should make it possible to set the pickle version, and yeah, if we did add a NONPORTABLE mode then presumably it would have some kind of header saying which version of python it was created with, so version mismatches could give a sensible error message. -n

On Thu, 29 Mar 2018 11:25:13 +0000 Nathaniel Smith <njs@pobox.com> wrote:
I doubt it'll be a problem to pickle though as it'll use some form of versioning even in NONPORTABLE mode right?
I guess the (merged, but undocumented?) changes in https://bugs.python.org/issue28053 should make it possible to set the pickle version [...]
Not only undocumented, but untested and they are actually look plain wrong when looking at that diff. Notice how "reduction" is imported using `from .context import reduction` and then changed inside the "context" module using `globals()['reduction'] = reduction`. That seems unlikely to produce any effect. (not to mention the abstract base class that doesn't seem to define any abstract methods or properties) To be frank such an unfinished patch should never have been committed. I may consider undoing it if I find some spare cycles. Regards Antoine.

Hi All, As part of our scikit-learn development and our effort to provide better parallelism for python, we rely heavily on dynamic classes and functions pickling. For this usage we use cloudpickle, but it suffers from performance issues due to its pure python implementation. After long discussions with Olivier Grisel and Thomas Moreau, we ended up agreeing on the fact that the best solution for this problem would be to add those functionalities to the _pickle.c module. I am already quite familiar with the C/Python API, and can dedicate a lot of my time in the next couple months to make this happen. Serhiy, from this thread (https://mail.python.org/pipermail/python-dev/2018-March/152509.html), it seems that you already started implementing local classes pickling. I would be happy to use this work as a starting point and build from it. What do you think? Regards, Pierre

Hi All, As part of our scikit-learn development and our effort to provide better parallelism for python, we rely heavily on dynamic classes and functions pickling. For this usage we use cloudpickle, but it suffers from performance issues due to its pure python implementation. After long discussions with Olivier Grisel and Thomas Moreau, we ended up agreeing on the fact that the best solution for this problem would be to add those functionalities to the _pickle.c module. I am already quite familiar with the C/Python API, and can dedicate a lot of my time in the next couple months to make this happen. Serhiy, from this thread (https://mail.python.org/pipermail/python-dev/2018-March/152509.html), it seems that you already started implementing local classes pickling. I would be happy to use this work as a starting point and build from it. What do you think? Regards, Pierre

One question.. On Thu., 29 Mar. 2018, 07:42 Antoine Pitrou, <solipsis@pitrou.net> wrote:
...
=======
Mutability ----------
PEP 3118 buffers [#pep-3118]_ can be readonly or writable. Some objects, such as Numpy arrays, need to be backed by a mutable buffer for full operation. Pickle consumers that use the ``buffer_callback`` and ``buffers`` arguments will have to be careful to recreate mutable buffers. When doing I/O, this implies using buffer-passing API variants such as ``readinto`` (which are also often preferrable for performance).
Data sharing ------------
If you pickle and then unpickle an object in the same process, passing out-of-band buffer views, then the unpickled object may be backed by the same buffer as the original pickled object.
For example, it might be reasonable to implement reduction of a Numpy array as follows (crucial metadata such as shapes is omitted for simplicity)::
class ndarray:
def __reduce_ex__(self, protocol): if protocol == 5: return numpy.frombuffer, (PickleBuffer(self), self.dtype) # Legacy code for earlier protocols omitted
Then simply passing the PickleBuffer around from ``dumps`` to ``loads`` will produce a new Numpy array sharing the same underlying memory as the original Numpy object (and, incidentally, keeping it alive)::
This seems incompatible with v4 semantics. There, a loads plus dumps combination is approximately a deep copy. This isn't. Sometimes. Sometimes it is. Other than that way, I like it. Rob

On Thu, 29 Mar 2018 01:40:17 +0000 Robert Collins <robertc@robertcollins.net> wrote:
Data sharing ------------
If you pickle and then unpickle an object in the same process, passing out-of-band buffer views, then the unpickled object may be backed by the same buffer as the original pickled object.
For example, it might be reasonable to implement reduction of a Numpy array as follows (crucial metadata such as shapes is omitted for simplicity)::
class ndarray:
def __reduce_ex__(self, protocol): if protocol == 5: return numpy.frombuffer, (PickleBuffer(self), self.dtype) # Legacy code for earlier protocols omitted
Then simply passing the PickleBuffer around from ``dumps`` to ``loads`` will produce a new Numpy array sharing the same underlying memory as the original Numpy object (and, incidentally, keeping it alive)::
This seems incompatible with v4 semantics. There, a loads plus dumps combination is approximately a deep copy. This isn't. Sometimes. Sometimes it is.
True. But it's only incompatible if you pass the new ``buffer_callback`` and ``buffers`` arguments. If you don't, then you always get a copy. This is something that consumers should keep in mind. Note there's a movement towards immutable data. For example, Dask arrays and Arrow arrays are designed as immutable. Regards Antoine.

On 29 March 2018 at 04:39, Antoine Pitrou <solipsis@pitrou.net> wrote:
Hi,
I'd like to submit this PEP for discussion. It is quite specialized and the main target audience of the proposed changes is users and authors of applications/libraries transferring large amounts of data (read: the scientific computing & data science ecosystems).
https://www.python.org/dev/peps/pep-0574/
The PEP text is also inlined below.
+1 from me, which you already knew :) For folks that haven't read Eric Snow's PEP 554 about exposing multiple interpreter support as a Python level API, Antoine's proposed zero-copy-data-management enhancements for pickle complement that nicely, since they allow the three initial communication primitives in PEP 554 (passing None, bytes, memory views) to be more efficiently expanded to handling arbitrary objects by sending first the pickle data, then the out-of-band memory views, and finally None as an end-of-message marker. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (11)
-
Antoine Pitrou
-
Chris Angelico
-
Chris Jerdonek
-
Nathaniel Smith
-
Nick Coghlan
-
Olivier Grisel
-
Paul Moore
-
Pierre Glaser
-
Robert Collins
-
Serhiy Storchaka
-
Terry Reedy