[Numpy-discussion] Behaviour of copy for structured dtypes with gaps

Francesc Alted faltet at gmail.com
Fri Apr 12 05:20:58 EDT 2019

I recently put some thought on the issue because a user was complaining
about PyTables unadvertendly removing the padding while doing a copy.
Incidentally, h5py also do respect padding while doing copies, so I took
this seriously and released a new PyTables version mainly for fixing this.
You can see the use case and my reflections here:

So, my take on this is that the padding is an integral part of the dtype
and should be respected during copies too (principle of minimal surprise).
With this, I am definitely aligned (pun intended) with contract (1).


Missatge de Nathaniel Smith <njs at pobox.com> del dia dv., 12 d’abr. 2019 a
les 4:08:

> My concern would be that to implement (2), I think .copy() has to
> either special-case certain dtypes, or else we have to add some kind
> of "simplify for copy" operation to the dtype protocol. These both add
> architectural complexity, so maybe it's better to avoid it unless we
> have a compelling reason?
> On Thu, Apr 11, 2019 at 6:51 AM Marten van Kerkwijk
> <m.h.vankerkwijk at gmail.com> wrote:
> >
> > Hi All,
> >
> > An issue [1] about the copying of arrays with structured dtype raised a
> question about what the expected behaviour is: does copy always preserve
> the dtype as is, or should it remove padding?
> >
> > Specifically, consider an array with a structure with many fields, say
> 'a' to 'z'. Since numpy 1.16, if one does a[['a', 'z']]`, a view will be
> returned. In this case, its dtype will include a large offset. Now, if we
> copy this view, should the result have exactly the same dtype, including
> the large offset (i.e., the copy takes as much memory as the original full
> array), or should the padding be removed? From the discussion so far, it
> seems the logic has boiled down to a choice between:
> >
> > (1) Copy is a contract that the dtype will not vary (e.g., we also do
> not change endianness);
> >
> > (2) Copy is a contract that any access to the data in the array will
> return exactly the same result, without wasting memory and possibly
> optimized for access with different strides. E.g., `array[::10].copy() also
> compacts the result.
> >
> > An argument in favour of (2) is that, before numpy 1.16, `a[['a',
> 'z']].copy()` did return an array without padding. Of course, this relied
> on `a[['a', 'z']]` already returning a copy without padding, but still this
> is a regression.
> >
> > More generally, there should at least be a clear way to get the compact
> copy. Also, it would make sense for things like `np.save` to remove any
> padding (it currently does not).
> >
> > What do people think? All the best,
> >
> > Marten
> >
> > [1] https://github.com/numpy/numpy/issues/13299
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> --
> Nathaniel J. Smith -- https://vorpus.org
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion

Francesc Alted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190412/b23593c1/attachment.html>

More information about the NumPy-Discussion mailing list