<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Jul 2, 2018 at 5:16 PM, Charles R Harris <span dir="ltr"><<a href="mailto:charlesr.harris@gmail.com" target="_blank">charlesr.harris@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span class="">On Mon, Jul 2, 2018 at 3:03 PM, Antoine Pitrou <span dir="ltr"><<a href="mailto:antoine@python.org" target="_blank">antoine@python.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

Hello,<br>

<br>

Some of you might know that I've been working on a PEP in order to<br>

improve pickling performance of large (or huge) data.  The PEP,<br>

numbered 574 and titled "Pickle protocol 5 with out-of-band data",<br>

allows participating data types to be pickled without any memory copy.<br>

<a href="https://www.python.org/dev/peps/pep-0574/" rel="noreferrer" target="_blank">https://www.python.org/dev/pep<wbr>s/pep-0574/</a><br>

<br>

The PEP already has an implementation, which is backported as an<br>

independent PyPI package under the name "pickle5".<br>

<a href="https://pypi.org/project/pickle5/" rel="noreferrer" target="_blank">https://pypi.org/project/pickl<wbr>e5/</a><br>

<br>

I also have a working patch updating PyArrow to use the PEP-defined<br>

extensions to allow for zero-copy pickling of Arrow arrays - without<br>

breaking compatibility with existing usage:<br>

<a href="https://github.com/apache/arrow/pull/2161" rel="noreferrer" target="_blank">https://github.com/apache/arro<wbr>w/pull/2161</a><br>

<br>

Still, it is obvious one the primary targets of PEP 574 is Numpy<br>

arrays, as the most prevalent datatype in the Python scientific<br>

ecosystem.  I'm personally satisfied with the current state of the PEP,<br>

but I'd like to have feedback from Numpy core maintainers.  I haven't<br>

tried (yet?) to draft a Numpy patch to add PEP 574 support, since that's<br>

likely to be more involved due to the complexity of Numpy and due to<br>

the core being written in C.  Therefore I would like some help<br>

evaluating whether the PEP is likely to be a good fit for Numpy.<br>

<br></blockquote><div><br></div></span><div>Maybe somewhat off topic, but we have had trouble with a 2 GiB limit on file writes on OS X. See <a href="https://github.com/numpy/numpy/issues/3858" target="_blank">https://github.com/numpy/<wbr>numpy/issues/3858</a>. Does your implementation work around that?</div></div></div></div></blockquote><div><br></div><div>ISTR that some parallel processing applications sent pickled arrays around to different processes, I don't know if that is still the case, but if so, no copy might be a big gain for them.</div><div><br></div><div>Chuck</div></div></div></div>