PEP 574 (pickle 5) implementation and backport available

Hi, While PEP 574 (pickle protocol 5 with out-of-band data) is still in draft status, I've made available an implementation in branch "pickle5" in my GitHub fork of CPython: https://github.com/pitrou/cpython/tree/pickle5 Also I've published an experimental backport on PyPI, for Python 3.6 and 3.7. This should help people play with the new API and features without having to compile Python: https://pypi.org/project/pickle5/ Any feedback is welcome. Regards Antoine.

Link to the PEP: "PEP 574 -- Pickle protocol 5 with out-of-band data" https://www.python.org/dev/peps/pep-0574/ Victor 2018-05-24 19:57 GMT+02:00 Antoine Pitrou <solipsis@pitrou.net>:
Hi,
While PEP 574 (pickle protocol 5 with out-of-band data) is still in draft status, I've made available an implementation in branch "pickle5" in my GitHub fork of CPython: https://github.com/pitrou/cpython/tree/pickle5
Also I've published an experimental backport on PyPI, for Python 3.6 and 3.7. This should help people play with the new API and features without having to compile Python: https://pypi.org/project/pickle5/
Any feedback is welcome.
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/vstinner%40redhat.com

I tried this implementation to add no-copy pickling for large numpy arrays and seems to work as expected (for a simple contiguous array). I took some notes on the numpy tracker to advertise this PEP to the numpy developers: https://github.com/numpy/numpy/issues/11161 -- Olivier

On May 24, 2018, at 10:57 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
While PEP 574 (pickle protocol 5 with out-of-band data) is still in draft status, I've made available an implementation in branch "pickle5" in my GitHub fork of CPython: https://github.com/pitrou/cpython/tree/pickle5
Also I've published an experimental backport on PyPI, for Python 3.6 and 3.7. This should help people play with the new API and features without having to compile Python: https://pypi.org/project/pickle5/
Any feedback is welcome.
Thanks for doing this. Hope it isn't too late, but I would like to suggest that protocol 5 support fast compression by default. We normally pickle objects so that they can be transported (saved to a file or sent over a socket). Transport costs (reading and writing a file or socket) are generally proportional to size, so compression is likely to be a net win (much as it was for header compression in HTTP/2). The PEP lists compression as a possible a refinement only for large objects, but I expect is will be a win for most pickles to compress them in their entirety. Raymond

On Fri, 25 May 2018 10:36:08 -0700 Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
On May 24, 2018, at 10:57 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
While PEP 574 (pickle protocol 5 with out-of-band data) is still in draft status, I've made available an implementation in branch "pickle5" in my GitHub fork of CPython: https://github.com/pitrou/cpython/tree/pickle5
Also I've published an experimental backport on PyPI, for Python 3.6 and 3.7. This should help people play with the new API and features without having to compile Python: https://pypi.org/project/pickle5/
Any feedback is welcome.
Thanks for doing this.
Hope it isn't too late, but I would like to suggest that protocol 5 support fast compression by default. We normally pickle objects so that they can be transported (saved to a file or sent over a socket). Transport costs (reading and writing a file or socket) are generally proportional to size, so compression is likely to be a net win (much as it was for header compression in HTTP/2).
The PEP lists compression as a possible a refinement only for large objects, but I expect is will be a win for most pickles to compress them in their entirety.
It's not too late (the PEP is still a draft, and there's a lot of time before 3.8), but I wonder what would be the benefit of making it a part of the pickle specification, rather than compressing independently. Whether and how to compress is generally a compromise between transmission (or storage) speed and computation speed. Also, there are specialized compressors for higher efficiency (for example, Blosc has datatype-specific compression for Numpy arrays). Such knowledge can be embodied in domain-specific libraries such as Dask/distributed, but it cannot really be incorporated in pickle itself. Do you have something specific in mind? Regards Antoine.

On 2018-05-25, Antoine Pitrou wrote:
Do you have something specific in mind?
I think compressed by default is a good idea. My quick proposal: - Use fast compression like lz4 or zlib with Z_BEST_SPEED - Add a 'compress' keyword argument with a default of None. For protocol 5, None means to compress. Providing 'compress' != None for older protocols will raise an error. The compression overhead will be small compared to the pickle/unpickle costs. If someone wants to apply their own (e.g. better) compression, they can set compress=False. An alternative idea is to have two different protocol formats. E.g. 5 and 6. One is "pickle 5" with compression, one without compression. I don't like that as much since it breaks the idea that higher protocol numbers are "better". Regards, Neil

On Fri, 25 May 2018 14:50:57 -0600 Neil Schemenauer <nas-python@arctrix.com> wrote:
On 2018-05-25, Antoine Pitrou wrote:
Do you have something specific in mind?
I think compressed by default is a good idea. My quick proposal:
- Use fast compression like lz4 or zlib with Z_BEST_SPEED
- Add a 'compress' keyword argument with a default of None. For protocol 5, None means to compress. Providing 'compress' != None for older protocols will raise an error.
The question is what purpose does it serve for pickle to do it rather than for the user to compress the pickle themselves. You're basically saving one line of code. Am I missing some other advantage? (also note that it requires us to ship the lz4 library with Python, or another modern compression library such as zstd; zlib's performance characteristics are outdated) Regards Antoine.

On 2018-05-25, Antoine Pitrou wrote:
The question is what purpose does it serve for pickle to do it rather than for the user to compress the pickle themselves. You're basically saving one line of code.
It's one line of code everywhere pickling or unpicking happens. And you probably need to import a compression module, so at least two lines. Then maybe you need to figure out if the pickle is compressed and what kind of compression is used. So, add a few more lines. It seems logical to me that users of pickle want it to be fast and produce small pickles. Compressing by default seems the right choice, even though it complicates the implementation. Ivan brings up a valid point that compressed pickles are harder to debug. However, I think that's much less important than being small.
it requires us to ship the lz4 library with Python
Yeah, that's not so great. I think zlib with Z_BEST_SPEED would be fine. However, some people might worry it is too slow or doesn't compress enough. Having lz4 as a battery included seems like a good idea anyhow. I understand that it is pretty well established as a useful compression method. Obviously requiring a new C library to be included expands the effort of implementation a lot. This discussion can easily lead into bikeshedding (e.g. relative merits of different compression schemes). Since I'm not volunteering to implement anything, I will stop responding at this point. ;-) Regards, Neil

On Fri, May 25, 2018 at 3:35 PM, Neil Schemenauer <nas-python@arctrix.com> wrote:
This discussion can easily lead into bikeshedding (e.g. relative merits of different compression schemes). Since I'm not volunteering to implement anything, I will stop responding at this point. ;-)
I think the bikeshedding -- or more to the point, the fact that there's a wide variety of options for compressing pickles, and none of them are appropriate in all circumstances -- means that this is something that should remain a separate layer. Even super-fast algorithms like lz4 are inefficient when you're transmitting pickles between two processes on the same system – they still add extra memory copies. And that's a very common use case. -n -- Nathaniel J. Smith -- https://vorpus.org

Antoine Pitrou schrieb am 25.05.2018 um 23:11:
On Fri, 25 May 2018 14:50:57 -0600 Neil Schemenauer wrote:
On 2018-05-25, Antoine Pitrou wrote:
Do you have something specific in mind?
I think compressed by default is a good idea. My quick proposal:
- Use fast compression like lz4 or zlib with Z_BEST_SPEED
- Add a 'compress' keyword argument with a default of None. For protocol 5, None means to compress. Providing 'compress' != None for older protocols will raise an error.
The question is what purpose does it serve for pickle to do it rather than for the user to compress the pickle themselves. You're basically saving one line of code. Am I missing some other advantage?
Regarding the pickling side, if the pickle is large, then it can save memory to compress while pickling, rather than compressing after pickling. But that can also be done with file-like objects, so the advantage is small here. I think a major advantage is on the unpickling side rather than the pickling side. Sure, users can compress a pickle after the fact, but if there's a (set of) standard algorithms that unpickle can handle automatically, then it's enough to pass "something pickled" into unpickle, rather than having to know (or figure out) if and how that pickle was originally compressed, and build up the decompression pipeline for it to get everything uncompressed efficiently without accidentally wasting memory or processing time. Obviously, auto-decompression opens up a gate for compression bombs, but then, unpickling data from untrusted sources is discouraged anyway, so... Stefan

On 25.05.2018 20:36, Raymond Hettinger wrote:
On May 24, 2018, at 10:57 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
While PEP 574 (pickle protocol 5 with out-of-band data) is still in draft status, I've made available an implementation in branch "pickle5" in my GitHub fork of CPython: https://github.com/pitrou/cpython/tree/pickle5
Also I've published an experimental backport on PyPI, for Python 3.6 and 3.7. This should help people play with the new API and features without having to compile Python: https://pypi.org/project/pickle5/
Any feedback is welcome. Thanks for doing this.
Hope it isn't too late, but I would like to suggest that protocol 5 support fast compression by default. We normally pickle objects so that they can be transported (saved to a file or sent over a socket). Transport costs (reading and writing a file or socket) are generally proportional to size, so compression is likely to be a net win (much as it was for header compression in HTTP/2).
The PEP lists compression as a possible a refinement only for large objects, but I expect is will be a win for most pickles to compress them in their entirety.
I would advise against that. Pickle format is unreadable as it is, compression will make it literally impossible to diagnose problems. Python supports transparent compression, e.g. with the 'zlib' codec.
Raymond _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/vano%40mail.mipt.ru
-- Regards, Ivan
participants (8)
-
Antoine Pitrou
-
Ivan Pozdeev
-
Nathaniel Smith
-
Neil Schemenauer
-
Olivier Grisel
-
Raymond Hettinger
-
Stefan Behnel
-
Victor Stinner