[Numpy-discussion] Behaviour of copy for structured dtypes with gaps
Allan Haldane
allanhaldane at gmail.com
Fri Apr 12 12:13:18 EDT 2019
I would be much more in favor of `copy` eliminating padding in the
dtype, if dtypes with different paddings were considered equivalent.
But they are not.
Numpy has always treated dtypes with different padding bytes as
not-equal, and prints them very differently:
>>> a = np.array([1], dtype={'names': ['f'],
... 'formats': ['i4'],
... 'offsets': [0]})
>>> b = np.array([1], dtype={'names': ['f'],
... 'formats': ['i4'],
... 'offsets': [4]})
>>> a.dtype == b.dtype
False
>>> a.dtype
dtype([('f', '<i4')])
>>> b.dtype
dtype({'names':['f'], 'formats':['<i4'], 'offsets':[4], 'itemsize':8})
That is unlike strides, which are hidden from the user.
If we do a "dtype-overhaul" as has been plentifully discussed before,
there are many things we might change about structured dtypes, and
making padding be irrelevant in most operations could be a good one.
On the other hand, one of the main purposes of structured arrays appears
to be for interpreting binary blobs and for interfacing with C code with
C structs, where padding could be very important. Eg, if someone is
reading a binary file, they might want to do
>>> np.fromfile('myfile', a.dtype, count=10)
and then it matters very greatly to them whether the dtype has padding
or not.
Best,
Allan
PS. It is unfinished, but I would like to advertise an 'ArrayCollection'
ndarray ducktype I have worked a bit on. This ducktype behaves very much
like structured arrays for indexing and assignment, but avoids all these
padding issues and in other ways is more suitable for "pandas-like"
usage than structured arrays. See the "ArrayCollection" and
"MaskedArrayCollection" classes at
https://github.com/ahaldane/ndarray_ducktypes
See the tests and doc folders for some brief example usage.
On 4/11/19 10:07 PM, Nathaniel Smith wrote:
> My concern would be that to implement (2), I think .copy() has to
> either special-case certain dtypes, or else we have to add some kind
> of "simplify for copy" operation to the dtype protocol. These both add
> architectural complexity, so maybe it's better to avoid it unless we
> have a compelling reason?
>
> On Thu, Apr 11, 2019 at 6:51 AM Marten van Kerkwijk
> <m.h.vankerkwijk at gmail.com> wrote:
>>
>> Hi All,
>>
>> An issue [1] about the copying of arrays with structured dtype raised a question about what the expected behaviour is: does copy always preserve the dtype as is, or should it remove padding?
>>
>> Specifically, consider an array with a structure with many fields, say 'a' to 'z'. Since numpy 1.16, if one does a[['a', 'z']]`, a view will be returned. In this case, its dtype will include a large offset. Now, if we copy this view, should the result have exactly the same dtype, including the large offset (i.e., the copy takes as much memory as the original full array), or should the padding be removed? From the discussion so far, it seems the logic has boiled down to a choice between:
>>
>> (1) Copy is a contract that the dtype will not vary (e.g., we also do not change endianness);
>>
>> (2) Copy is a contract that any access to the data in the array will return exactly the same result, without wasting memory and possibly optimized for access with different strides. E.g., `array[::10].copy() also compacts the result.
>>
>> An argument in favour of (2) is that, before numpy 1.16, `a[['a', 'z']].copy()` did return an array without padding. Of course, this relied on `a[['a', 'z']]` already returning a copy without padding, but still this is a regression.
>>
>> More generally, there should at least be a clear way to get the compact copy. Also, it would make sense for things like `np.save` to remove any padding (it currently does not).
>>
>> What do people think? All the best,
>>
>> Marten
>>
>> [1] https://github.com/numpy/numpy/issues/13299
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
>
More information about the NumPy-Discussion
mailing list