Preserving NumPy views when pickling

With a custom wrapper class, it's possible to preserve NumPy views when pickling: https://stackoverflow.com/questions/13746601/preserving-numpy-view-when-pick... This can result in significant time/space savings with pickling views along with base arrays and brings the behavior of NumPy more in line with Python proper. Is this something that we can/should port into NumPy itself?

On Tue, Oct 25, 2016 at 12:38 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
Concretely, what do would you suggest should happen with: base = np.zeros(100000000) view = base[:10] # case 1 pickle.dump(view, file) # case 2 pickle.dump(base, file) pickle.dump(view, file) # case 3 pickle.dump(view, file) pickle.dump(base, file) ? -- Nathaniel J. Smith -- https://vorpus.org

On Tue, Oct 25, 2016 at 1:07 PM, Nathaniel Smith <njs@pobox.com> wrote:
I see what you're getting at here. We would need a rule for when to include the base in the pickle and when not to. Otherwise, pickle.dump(view, file) always contains data from the base pickle, even with view is much smaller than base. The safe answer is "only use views in the pickle when base is already being pickled", but that isn't possible to check unless all the arrays are together in a custom container. So, this isn't really feasible for NumPy.

On Tue, Oct 25, 2016 at 3:07 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
include the base in the pickle and when not to. Otherwise, pickle.dump(view, file) always contains data from the base pickle, even with view is much smaller than base.
The safe answer is "only use views in the pickle when base is already
being pickled", but that isn't possible to check unless all the arrays are together in a custom container. So, this isn't really feasible for NumPy. It would be possible with a custom Pickler/Unpickler since they already keep track of objects previously (un)pickled. That would handle [base, view] okay but not [view, base], so it's probably not going to be all that useful outside of special situations. It would make a neat recipe, but I probably would not provide it in numpy itself. -- Robert Kern

It seems pickle keeps track of references for basic python types. x = [1] y = [x] x,y = pickle.loads(pickle.dumps((x,y))) x.append(2) print(y)
[[1,2]]
Numpy arrays are different but references are forgotten after pickle/unpickle. Shared objects do not remain shared. Based on the quote below it could be considered bug with numpy/pickle. Object sharing (references to the same object in different places): This is similar to self-referencing objects; pickle stores the object once, and ensures that all other references point to the master copy. Shared objects remain shared, which can be very important for mutable objects. link <https://docs.python.org/2.0/lib/module-pickle.html> Another example with ndarrays: x = np.arange(5) y = x[::-1] x, y = pickle.loads(pickle.dumps((x, y))) x[0] = 9 print(y)
[4, 3, 2, 1, 0]
In this case the two arrays share the exact same object for the data buffer (although object might not be the right word here) On Tue, Oct 25, 2016 at 7:28 PM, Robert Kern <robert.kern@gmail.com> wrote:

On Tue, Oct 25, 2016 at 5:09 PM, Matthew Harrigan < harrigan.matthew@gmail.com> wrote:
pickle/unpickle. Shared objects do not remain shared. Based on the quote below it could be considered bug with numpy/pickle. Not a bug, but an explicit design decision on numpy's part. -- Robert Kern

Hi, Just another perspective. base' and 'data' in PyArrayObject are two separate variables. base can point to any PyObject, but it is `data` that defines where data is accessed in memory. 1. There is no clear way to pickle a pointer (`data`) in a meaningful way. In order for `data` member to make sense we still need to 'readout' the values stored at `data` pointer in the pickle. 2. By definition base is not necessary a numpy array but it is just some other object for managing the memory. 3. One can surely pickle the `base` object as a reference, but it is useless if the data memory has been reconstructed independently during unpickling. 4. Unless there is clear way to notify the referencing numpy array of the new data pointer. There probably isn't. BTW, is the stride information is lost during pickling, too? The behavior shall probably be documented if not yet. Yu On Tue, Oct 25, 2016 at 5:29 PM, Robert Kern <robert.kern@gmail.com> wrote:

On Tue, Oct 25, 2016 at 7:05 PM, Feng Yu <rainwoodman@gmail.com> wrote:
In general, yes, but most often it's another ndarray, and the child is related to the parent by a slice operation that could be computed by comparing the `data` tuples. The exercise here isn't to always represent the general case in this way, but to see what can be done opportunistically and if that actually helps solve a practical problem.
The stride information may be lost, yes. We reserve the right to retain it, though (for example, if .T is contiguous then we might well serialize the transposed data linearly and return a view on that data upon deserialization). I don't believe that we guarantee that the unpickled result is contiguous. -- Robert Kern

On Tue, Oct 25, 2016 at 5:09 PM, Matthew Harrigan <harrigan.matthew@gmail.com> wrote:
Yes, but the problem is: suppose I have a 10 gigabyte array, and then take a 20 byte slice of it, and then pickle that slice. Do you expect the pickle file to be 20 bytes, or 10 gigabytes? Both options are possible, but you have to pick one, and numpy picks 20 bytes. The advantage is obviously that you don't have mysterious 10 gigabyte pickle files; the disadvantage is that you can't reconstruct the view relationships afterwards. (You might think: oh, but we can be clever, and only record the view relationships if the user pickles both objects together. But while pickle might know whether the user is pickling both objects together, it unfortunately doesn't tell numpy, so we can't really do anything clever or different in this case.) -n -- Nathaniel J. Smith -- https://vorpus.org

On Tue, Oct 25, 2016 at 12:38 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
Concretely, what do would you suggest should happen with: base = np.zeros(100000000) view = base[:10] # case 1 pickle.dump(view, file) # case 2 pickle.dump(base, file) pickle.dump(view, file) # case 3 pickle.dump(view, file) pickle.dump(base, file) ? -- Nathaniel J. Smith -- https://vorpus.org

On Tue, Oct 25, 2016 at 1:07 PM, Nathaniel Smith <njs@pobox.com> wrote:
I see what you're getting at here. We would need a rule for when to include the base in the pickle and when not to. Otherwise, pickle.dump(view, file) always contains data from the base pickle, even with view is much smaller than base. The safe answer is "only use views in the pickle when base is already being pickled", but that isn't possible to check unless all the arrays are together in a custom container. So, this isn't really feasible for NumPy.

On Tue, Oct 25, 2016 at 3:07 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
include the base in the pickle and when not to. Otherwise, pickle.dump(view, file) always contains data from the base pickle, even with view is much smaller than base.
The safe answer is "only use views in the pickle when base is already
being pickled", but that isn't possible to check unless all the arrays are together in a custom container. So, this isn't really feasible for NumPy. It would be possible with a custom Pickler/Unpickler since they already keep track of objects previously (un)pickled. That would handle [base, view] okay but not [view, base], so it's probably not going to be all that useful outside of special situations. It would make a neat recipe, but I probably would not provide it in numpy itself. -- Robert Kern

It seems pickle keeps track of references for basic python types. x = [1] y = [x] x,y = pickle.loads(pickle.dumps((x,y))) x.append(2) print(y)
[[1,2]]
Numpy arrays are different but references are forgotten after pickle/unpickle. Shared objects do not remain shared. Based on the quote below it could be considered bug with numpy/pickle. Object sharing (references to the same object in different places): This is similar to self-referencing objects; pickle stores the object once, and ensures that all other references point to the master copy. Shared objects remain shared, which can be very important for mutable objects. link <https://docs.python.org/2.0/lib/module-pickle.html> Another example with ndarrays: x = np.arange(5) y = x[::-1] x, y = pickle.loads(pickle.dumps((x, y))) x[0] = 9 print(y)
[4, 3, 2, 1, 0]
In this case the two arrays share the exact same object for the data buffer (although object might not be the right word here) On Tue, Oct 25, 2016 at 7:28 PM, Robert Kern <robert.kern@gmail.com> wrote:

On Tue, Oct 25, 2016 at 5:09 PM, Matthew Harrigan < harrigan.matthew@gmail.com> wrote:
pickle/unpickle. Shared objects do not remain shared. Based on the quote below it could be considered bug with numpy/pickle. Not a bug, but an explicit design decision on numpy's part. -- Robert Kern

Hi, Just another perspective. base' and 'data' in PyArrayObject are two separate variables. base can point to any PyObject, but it is `data` that defines where data is accessed in memory. 1. There is no clear way to pickle a pointer (`data`) in a meaningful way. In order for `data` member to make sense we still need to 'readout' the values stored at `data` pointer in the pickle. 2. By definition base is not necessary a numpy array but it is just some other object for managing the memory. 3. One can surely pickle the `base` object as a reference, but it is useless if the data memory has been reconstructed independently during unpickling. 4. Unless there is clear way to notify the referencing numpy array of the new data pointer. There probably isn't. BTW, is the stride information is lost during pickling, too? The behavior shall probably be documented if not yet. Yu On Tue, Oct 25, 2016 at 5:29 PM, Robert Kern <robert.kern@gmail.com> wrote:

On Tue, Oct 25, 2016 at 7:05 PM, Feng Yu <rainwoodman@gmail.com> wrote:
In general, yes, but most often it's another ndarray, and the child is related to the parent by a slice operation that could be computed by comparing the `data` tuples. The exercise here isn't to always represent the general case in this way, but to see what can be done opportunistically and if that actually helps solve a practical problem.
The stride information may be lost, yes. We reserve the right to retain it, though (for example, if .T is contiguous then we might well serialize the transposed data linearly and return a view on that data upon deserialization). I don't believe that we guarantee that the unpickled result is contiguous. -- Robert Kern

On Tue, Oct 25, 2016 at 5:09 PM, Matthew Harrigan <harrigan.matthew@gmail.com> wrote:
Yes, but the problem is: suppose I have a 10 gigabyte array, and then take a 20 byte slice of it, and then pickle that slice. Do you expect the pickle file to be 20 bytes, or 10 gigabytes? Both options are possible, but you have to pick one, and numpy picks 20 bytes. The advantage is obviously that you don't have mysterious 10 gigabyte pickle files; the disadvantage is that you can't reconstruct the view relationships afterwards. (You might think: oh, but we can be clever, and only record the view relationships if the user pickles both objects together. But while pickle might know whether the user is pickling both objects together, it unfortunately doesn't tell numpy, so we can't really do anything clever or different in this case.) -n -- Nathaniel J. Smith -- https://vorpus.org
participants (5)
-
Feng Yu
-
Matthew Harrigan
-
Nathaniel Smith
-
Robert Kern
-
Stephan Hoyer