Mailman 3 dtype subarray comparison - NumPy-Discussion

newer
numpy.polynomial.Polynomial

dtype subarray comparison

older
test_creation Errors with np...

Mark Wiebe

Oct. 20, 2010

11:59 p.m.

It turns out that when comparing dtypes, the subarray is currently ignored. This means you can get things like this:

...

...
...
import numpy as np np.dtype(('f4',(6))) == np.dtype(('f4',(2,3))) True np.dtype(([('a','i4')],2)) == np.dtype(([('b','u2'),('c','i2')],2)) True

which pretty clearly should be False in both cases. I've implemented a patch to fix this, for which a pull request is here: http://github.com/numpy/numpy/pull/6 The main points included in this patch are: * If there's a subarray shape, it has to match exactly. * Currently, sometimes the subarray shape was an integer, other times a tuple. The output formatting code checks for this and prints a tuple in both cases. I changed the code to turn the integer into a tuple on construction instead. This didn't cause any tests to fail, so I think it's a safe change. * The current code converts (type, 1) and (type, tuple()) to just type, so this means (type, 1) != (type, (1,)) but (type, 2) == (type, (2,)) I put in some tests of these edge cases. http://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/descript... Cheers, Mark

Attachments:

attachment.htm (text/html — 1.7 KB)

Show replies by date

Charles R Harris

October 2010

5:03 p.m.

On Wed, Oct 20, 2010 at 5:59 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...

It turns out that when comparing dtypes, the subarray is currently ignored. This means you can get things like this:

...
...
...
import numpy as np np.dtype(('f4',(6))) == np.dtype(('f4',(2,3))) True np.dtype(([('a','i4')],2)) == np.dtype(([('b','u2'),('c','i2')],2)) True

which pretty clearly should be False in both cases. I've implemented a patch to fix this, for which a pull request is here:

http://github.com/numpy/numpy/pull/6

The main points included in this patch are:

* If there's a subarray shape, it has to match exactly. * Currently, sometimes the subarray shape was an integer, other times a tuple. The output formatting code checks for this and prints a tuple in both cases. I changed the code to turn the integer into a tuple on construction instead. This didn't cause any tests to fail, so I think it's a safe change. * The current code converts (type, 1) and (type, tuple()) to just type, so this means (type, 1) != (type, (1,)) but (type, 2) == (type, (2,)) I put in some tests of these edge cases.

http://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/descript...

I think the important thing to settle here is just how dtype comparisons are supposed to work. The current implementation might well be intended and if so we need to know why it is so. I'd like Travis to weigh in here. Chuck

Pauli Virtanen

6:18 p.m.

Thu, 21 Oct 2010 11:03:52 -0600, Charles R Harris wrote: [clip]

...

I think the important thing to settle here is just how dtype comparisons are supposed to work. The current implementation might well be intended and if so we need to know why it is so. I'd like Travis to weigh in here.

To me it seems that sub-arrays were not considered in the initial implementation for `==`. I do not see reasons why one would want to have a = dtype([('a', 'i1', (3, 3))]) b = dtype([('a', 'i1', 9)]) a == b In any case the current implementation is inconsistent, since currently hash(a) != hash(b) However, I can see why one would want to have CanCastSafely to be true between these types. *** The other issue of allowing broadcasting in sub-arrays --- it does not seem very useful to me. Unlike arrays, the dimensions of sub-arrays cannot be manipulated easily, and so many use-cases of broadcasting just disappear. Pauli

Charles R Harris

7:22 p.m.

On Thu, Oct 21, 2010 at 12:18 PM, Pauli Virtanen <pav@iki.fi> wrote:

...

Thu, 21 Oct 2010 11:03:52 -0600, Charles R Harris wrote: [clip]

...
I think the important thing to settle here is just how dtype comparisons are supposed to work. The current implementation might well be intended and if so we need to know why it is so. I'd like Travis to weigh in here.

To me it seems that sub-arrays were not considered in the initial implementation for `==`.

I do not see reasons why one would want to have

a = dtype([('a', 'i1', (3, 3))]) b = dtype([('a', 'i1', 9)]) a == b

In any case the current implementation is inconsistent, since currently

hash(a) != hash(b)

However, I can see why one would want to have CanCastSafely to be true between these types.

Checking the shape seems like the right thing to do, I just want to be sure that we aren't overlooking something. If we make the change will we need to fix CanCastSafely?

...

***

The other issue of allowing broadcasting in sub-arrays --- it does not seem very useful to me. Unlike arrays, the dimensions of sub-arrays cannot be manipulated easily, and so many use-cases of broadcasting just disappear.

Chuck

Mark Wiebe

7:50 p.m.

...

<snip>

The other issue of allowing broadcasting in sub-arrays --- it does not seem very useful to me. Unlike arrays, the dimensions of sub-arrays cannot be manipulated easily, and so many use-cases of broadcasting just disappear. I implemented the a/b shape checking in the structured array comparison code, and it breaks 6 unit tests. The errors it produces are like the following: ====================================================================== FAIL: test_multiarray.TestNewBufferProtocol.test_roundtrip ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/nose/case.py", line 186, in runTest self.test(*self.arg) File "/home/mwiebe/installtest/lib64/python2.6/site-packages/numpy/core/tests/test_multiarray.py", line 1635, in test_roundtrip self._check_roundtrip(x) File "/home/mwiebe/installtest/lib64/python2.6/site-packages/numpy/core/tests/test_multiarray.py", line 1595, in _check_roundtrip assert_array_equal(obj, y) File "/home/mwiebe/installtest/lib64/python2.6/site-packages/numpy/testing/utils.py", line 686, in assert_array_equal verbose=verbose, header='Arrays are not equal') File "/home/mwiebe/installtest/lib64/python2.6/site-packages/numpy/testing/utils.py", line 618, in assert_array_compare raise AssertionError(msg) AssertionError: Arrays are not equal (mismatch 100.0%) x: array(([[1, 2], [3, 4]],), dtype=[('a', '<i8', (2, 2))]) y: array([([[1, 2], [3, 4]],)], dtype=[('a', '<i8', (2, 2))]) Here's what this test does:

...

...
...
import numpy as np from numpy.core.multiarray import memorysimpleview as memoryview obj = np.array(([[1, 2], [3, 4]],), dtype=[('a', '<i8', (2, 2))]) x = memoryview(obj) y = np.asarray(x) obj == y False

...

...
...
y['a'].shape (1, 2, 2) obj['a'].shape (2, 2)

This happens because y.shape is (1,) and obj.shape is (), then the dtype's (2,2) shape is being appended to both when extracting the field 'a'. Cheers, Mark p.s.: Here's the small code addition, which I can add to the branch if desired: diff --git a/numpy/core/src/multiarray/arrayobject.c b/numpy/core/src/multiarray/a index 6e5bd9a..f12012c 100644 --- a/numpy/core/src/multiarray/arrayobject.c +++ b/numpy/core/src/multiarray/arrayobject.c @@ -844,6 +844,14 @@ _void_compare(PyArrayObject *self, PyArrayObject *other, int Py_DECREF(a); return NULL; } + if (PyArray_NDIM(a) != PyArray_NDIM(b) || + !PyArray_CompareLists(PyArray_DIMS(a), PyArray_DIMS(b), PyArr + Py_XDECREF(res); + Py_DECREF(a); + Py_DECREF(b); + Py_INCREF(Py_NotImplemented); + return Py_NotImplemented; + } temp = array_richcompare((PyArrayObject *)a,b,cmp_op); Py_DECREF(a); Py_DECREF(b);

Pauli Virtanen

8:06 p.m.

Thu, 21 Oct 2010 12:50:11 -0700, Mark Wiebe wrote: [clip]

...

...
...
...
import numpy as np from numpy.core.multiarray import memorysimpleview as memoryview obj = np.array(([[1, 2], [3, 4]],), dtype=[('a', '<i8', (2, 2))]) x = memoryview(obj) y = np.asarray(x) obj == y False

...
...
...
y['a'].shape (1, 2, 2) obj['a'].shape (2, 2)

This happens because y.shape is (1,) and obj.shape is (), then the dtype's (2,2) shape is being appended to both when extracting the field 'a'.

Nice catch! It seems that `_void_compare` *must* handle broadcasting itself -- I thought this was done by the caller, but apparently not. So you're right that the correct place to do the shape check is on the dtype level. -- Pauli Virtanen

Mark Wiebe

8:33 p.m.

On Thu, Oct 21, 2010 at 1:06 PM, Pauli Virtanen <pav@iki.fi> wrote:

...

<snip> It seems that `_void_compare` *must* handle broadcasting itself -- I thought this was done by the caller, but apparently not. So you're right that the correct place to do the shape check is on the dtype level.

There's a bug in my current patch with regard to this then, as follows:

...

...
...
x = np.array([(0,),(0,),(1,)],dtype=[('a','f8',(1,))]) y = np.array([[(0,)],[(1,)]],dtype=[('a','f8',(1,))]) x==y array([ True, False, False], dtype=bool) y==x array([[ True, True, False], [False, False, True]], dtype=bool)

-Mark

Pauli Virtanen

9:11 p.m.

Thu, 21 Oct 2010 13:33:21 -0700, Mark Wiebe wrote: [clip]

...

There's a bug in my current patch with regard to this then, as follows:

...
...
...
x = np.array([(0,),(0,),(1,)],dtype=[('a','f8',(1,))]) y = np.array([[(0,)],[(1,)]],dtype=[('a','f8',(1,))]) x==y array([ True, False, False], dtype=bool) y==x array([[ True, True, False], [False, False, True]], dtype=bool)

Also, it is not correct to assume the dimensions are added to the end:

...

...
...
x = np.zeros((2,3),dtype=[('a','f8',(4,))]).T x.shape (3, 2) x['a'].shape (4, 3, 2)

There's a special branch in the field access code that does this for Fortran-order arrays. Or, more precisely, arrays with the fortran-order flag set. (IIRC, there are some ways to create arrays in Fortran order so that the flag does not get set.) This seems to be somewhat of a mis-feature to me -- it makes predicting code behavior difficult to anticipate. -- Pauli Virtanen

Mark Wiebe

9:25 p.m.

On Thu, Oct 21, 2010 at 2:11 PM, Pauli Virtanen <pav@iki.fi> wrote:

...

<snip>

Also, it is not correct to assume the dimensions are added to the end:

...
...
...
x = np.zeros((2,3),dtype=[('a','f8',(4,))]).T x.shape (3, 2) x['a'].shape (4, 3, 2)

There's a special branch in the field access code that does this for Fortran-order arrays. Or, more precisely, arrays with the fortran-order flag set. (IIRC, there are some ways to create arrays in Fortran order so that the flag does not get set.)

This seems to be somewhat of a mis-feature to me -- it makes predicting code behavior difficult to anticipate.

That sounds like it could be tricky to handle correctly, especially since it should probably work to compare Fortran-order arrays with non-Fortran ones. Does anything actually depend on this behavior? Maybe it's something that could change? -Mark

Pauli Virtanen

9:47 p.m.

Thu, 21 Oct 2010 14:25:24 -0700, Mark Wiebe wrote: [clip]

...

That sounds like it could be tricky to handle correctly, especially since it should probably work to compare Fortran-order arrays with non-Fortran ones.

Does anything actually depend on this behavior? Maybe it's something that could change?

I believe nothing in Numpy or Scipy depends on that. The main question is how much of 3rd party code is likely to use this feature. I suppose this quirk is not widely relied on, and as far as I know, the behavior has not been documented. For instance, Travis's Guide to Numpy says (on subdtype): "If a field whose dtype object has this attribute is retrieved, then the extra dimensions implied by the shape are tacked on to the end of the retrieved array." IMO, it should be possible to change this with little prior notice. A can of worms :) -- Pauli Virtanen

Mark Wiebe

11:15 p.m.

On Thu, Oct 21, 2010 at 2:47 PM, Pauli Virtanen <pav@iki.fi> wrote:

...

Thu, 21 Oct 2010 14:25:24 -0700, Mark Wiebe wrote: [clip]

...
That sounds like it could be tricky to handle correctly, especially since it should probably work to compare Fortran-order arrays with non-Fortran ones.

Does anything actually depend on this behavior? Maybe it's something that could change?

I believe nothing in Numpy or Scipy depends on that. The main question is how much of 3rd party code is likely to use this feature.

I suppose this quirk is not widely relied on, and as far as I know, the behavior has not been documented. For instance, Travis's Guide to Numpy says (on subdtype): "If a field whose dtype object has this attribute is retrieved, then the extra dimensions implied by the shape are tacked on to the end of the retrieved array."

IMO, it should be possible to change this with little prior notice.

A can of worms :)

I found the commit where Travis introduced the behaviour: http://github.com/numpy/numpy/commit/2fd136eb7977dd5172eb92d173a7c39921b0cf9... http://projects.scipy.org/numpy/changeset/2116 I didn't find any discussion in the mailing list or a trac ticket, so hopefully Travis could elaborate on it? -Mark

5237

Age (days ago)

5238

Last active (days ago)

List overview

Download

10 comments

3 participants

participants (3)

Charles R Harris
Mark Wiebe
Pauli Virtanen

dtype subarray comparison

tags

participants (3)