dtype=object behavior change from 0.9.6 to beta 1
![](https://secure.gravatar.com/avatar/64d7caa0e3bf58428afc9c729765988f.jpg?s=120&d=mm&r=g)
In version 0.9.6 one used to be able to do this: In [4]: numpy.__version__ Out[4]: '0.9.6' In [6]: numpy.array([numpy.array([4,5,6]), numpy.array([1,2,3])], dtype=object).shape Out[6]: (2, 3) In beta 1 numpy.PyObject no longer exists. I was trying to get the same behavior with dtype=object but it doesn't work: In [7]: numpy.__version__ Out[7]: '1.0b1' In [8]: numpy.array([numpy.array([4,5,6]), numpy.array([1,2,3])], dtype=object).shape Out[8]: (2,) Is this an intentional change?
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 8/31/06, Tom Denniston <tom.denniston@alum.dartmouth.org> wrote:
The latter looks more correct, in that is produces an array of objects. To get the previous behaviour there is the function vstack: In [6]: a = array([1,2,3]) In [7]: b = array([4,5,6]) In [8]: vstack([a,b]) Out[8]: array([[1, 2, 3], [4, 5, 6]]) Chuck
![](https://secure.gravatar.com/avatar/64d7caa0e3bf58428afc9c729765988f.jpg?s=120&d=mm&r=g)
For this simple example yes, but if one of the nice things about the array constructors is that they know that lists, tuple and arrays are just sequences and any combination of them is valid numpy input. So for instance a list of tuples yields a 2d array. A list of tuples of 1d arrays yields a 3d array. A list of 1d arrays yields 2d array. This was the case consistently across all dtypes. Now it is the case across all of them except for the dtype=object which has this unusual behavior. I agree that vstack is a better choice when you know you have a list of arrays but it is less useful when you don't know and you can't force a type in the vstack function so there is no longer an equivalent to the dtype=object behavior: In [7]: numpy.array([numpy.array([1,2,3]), numpy.array([4,5,6])], dtype=object) Out[7]: array([[1, 2, 3], [4, 5, 6]], dtype=object) In [8]: numpy.vstack([numpy.array([1,2,3]), numpy.array([4,5,6])], dtype=object) --------------------------------------------------------------------------- exceptions.TypeError Traceback (most recent call last) TypeError: vstack() got an unexpected keyword argument 'dtype' On 8/31/06, Charles R Harris <charlesr.harris@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 8/31/06, Tom Denniston <tom.denniston@alum.dartmouth.org> wrote:
What are you trying to do? If you want integers: In [32]: a = array([array([1,2,3]), array([4,5,6])], dtype=int) In [33]: a.shape Out[33]: (2, 3) If you want objects, you have them: In [30]: a = array([array([1,2,3]), array([4,5,6])], dtype=object) In [31]: a.shape Out[31]: (2,) i.e, a is an array containing two array objects. Chuck
![](https://secure.gravatar.com/avatar/64d7caa0e3bf58428afc9c729765988f.jpg?s=120&d=mm&r=g)
But i have hetergenious arrays that have numbers and strings and NoneType, etc. Take for instance: In [11]: numpy.array([numpy.array([1,'A', None]), numpy.array([2,2,'Some string'])], dtype=object) Out[11]: array([[1, A, None], [2, 2, Some string]], dtype=object) In [12]: numpy.array([numpy.array([1,'A', None]), numpy.array([2,2,'Some string'])], dtype=object).shape Out[12]: (2, 3) Works fine in Numeric and pre beta numpy but in beta numpy versions i get: In [6]: numpy.array([numpy.array([1,'A', None]), numpy.array([2,2,'Some string'])], dtype=object) Out[6]: array([[1 A None], [2 2 Some string]], dtype=object) In [7]: numpy.array([numpy.array([1,'A', None]), numpy.array([2,2,'Some string'])], dtype=object).shape Out[7]: (2,) But a lists of lists still gives: In [9]: numpy.array([[1,'A', None], [2,2,'Some string']], dtype=object).shape Out[9]: (2, 3) And if you omit the dtype and use a list of arrays then you get a string array with 2,3 dimensions: In [11]: numpy.array([numpy.array([1,'A', None]), numpy.array([2,2,'Some string'])]).shape Out[11]: (2, 3) This new behavior strikes me as inconsistent. One of the things I love about numpy is the ufuncs, array constructors, etc don't care about whether you pass a combination of lists, arrays, tuples, etc. They just know what you _mean_. And what you _mean_ by a list of lists, tuple of arrays, list of arrays, etc is very consistent across constructors and functions. I think making an exception for dtype=object introduces a lot of inconsistencies and it isn't clear to me what is gained. Does anyone commonly use the array interface in a manner that this new behavior is actuallly favorable? I may be overlooking a common use case or something like that. On 8/31/06, Charles R Harris <charlesr.harris@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 8/31/06, Tom Denniston <tom.denniston@alum.dartmouth.org> wrote:
I think you want: In [59]: a = array([array([1,'A', None],dtype=object),array([2,2,'Some string'],dtype=object)]) In [60]: a.shape Out[60]: (2, 3) Which makes good sense to me. Chuck
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 8/31/06, Charles R Harris <charlesr.harris@gmail.com> wrote:
OK, I changed my mind. I think you are right and this is a bug. Using svn revision 3098 I get In [2]: a = array([1,'A', None]) --------------------------------------------------------------------------- exceptions.TypeError Traceback (most recent call last) /home/charris/<ipython console> TypeError: expected a readable buffer object Which is different than you get with beta 1 in any case. I think array should cast the objects in the list to the first common dtype, object in this case. So the previous should be shorthand for: In [3]: a = array([1,'A', None], dtype=object) In [4]: a.shape Out[4]: (3,) Chuck
![](https://secure.gravatar.com/avatar/64d7caa0e3bf58428afc9c729765988f.jpg?s=120&d=mm&r=g)
Yes one can take a toy example and hack it to work but I don't necessarily have control over the input as to whether it is a list of object arrays, list of 1d heterogenous arrays, etc. Before I didn't need to worry about the input because numpy understood that a list of 1d arrays is a 2d piece of data. Now it understands this for all dtypes except object. My question was is this new set of semantics preferable to the old. I think your example kind of proves my point. Does it really make any sense for the following two ways of specifying an array give such different results? They strike me as _meaning_ the same thing. Doesn't it seem inconsistent to you? In [13]: array([array([1,'A', None], dtype=object),array([2,2,'Some string'],dtype=object)], dtype=object).shape Out[13]: (2,) and In [14]: array([array([1,'A', None], dtype=object),array([2,2,'Some string'],dtype=object)]).shape Out[14]: (2, 3) So my question is what is the _advantage_ of the new semantics? The two examples above used to give the same results. In what cases is it preferable for them to give different results? How does it make life simpler? On 8/31/06, Charles R Harris <charlesr.harris@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Tom Denniston wrote:
So my question is what is the _advantage_ of the new semantics?
what if the list don't have the same length, and therefor can not be made into an array, now you get a weird result:
N.array([N.array([1,'A',None],dtype=object),N.array([2,2,'Somestring',5],dtype=object)]).shape ()
Now you get an Object scalar. but:
N.array([N.array([1,'A',None],dtype=object),N.array([2,2,'Somestring',5],dtype=object)],dtype=object).shape (2,)
Now you get a length 2 array, just like before: far more consistent. With the old semantics, if you test your code with arrays of different lengths, you'll get one thing, but if they then happen to be the same length in some production use, the whole thing breaks -- this is a Bad Idea. Object arrays are just plain weird, there is nothing you can do that will satisfy every need. I think it's best for the array constructor to not try to guess at what the hierarchy of sequences you *meant* to use. You can (and probably should) always be explicit with: dtype=object),N.array([2,2,'Somestring',5],dtype=object)]
A array([[1 A None], [2 2 Somestring 5]], dtype=object)
-Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/64d7caa0e3bf58428afc9c729765988f.jpg?s=120&d=mm&r=g)
I would think one would want to throw an error when the data has inconsistent dimensions. This is what numpy does for other dtypes: In [10]: numpy.array(([1,2,3], [4,5,6])) Out[10]: array([[1, 2, 3], [4, 5, 6]]) In [11]: numpy.array(([1,3], [4,5,6])) --------------------------------------------------------------------------- exceptions.TypeError Traceback (most recent call last) TypeError: an integer is required On 8/31/06, Christopher Barker <Chris.Barker@noaa.gov> wrote:
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Tom Denniston wrote:
I would think one would want to throw an error when the data has inconsistent dimensions.
But it doesn't have inconsistent dimensions - they are perfectly consistent with a (2,) array of objects. How is the code to know what you intended? With numeric types, it is unambiguous to march down through the sequences until you get a number. As a sequence is an object, there no way to unambiguously do this automatically. Perhaps the way to solve this is for the array constructor to take a "shape" or "rank" argument, so you could specify what you intend. But that's really just syntactic sugar to avoid for calling numpy.empty() first. Perhaps a numpy.object_array() constructor would be useful, although as I think about it, even specifying a shape or rank would not be unambiguous! This is a useful discussion. If we ever get a nd-array into the standard lib, I suspect that object arrays will get heavy use -- better to clean up the semantics now. Perhaps a Wiki page on building object arrays is called for. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 8/31/06, Christopher Barker <Chris.Barker@noaa.gov> wrote:
Same as it produces a float array from array([1,2,3.0]). Array is a complicated function for precisely these sort of reasons, but the convenience makes it worthwhile. So, if a list contains something that can only be interpreted as an object, dtype should be set to object. With numeric types, it is unambiguous to march down through the
Chuck
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 8/31/06, Tom Denniston <tom.denniston@alum.dartmouth.org> wrote:
The latter looks more correct, in that is produces an array of objects. To get the previous behaviour there is the function vstack: In [6]: a = array([1,2,3]) In [7]: b = array([4,5,6]) In [8]: vstack([a,b]) Out[8]: array([[1, 2, 3], [4, 5, 6]]) Chuck
![](https://secure.gravatar.com/avatar/64d7caa0e3bf58428afc9c729765988f.jpg?s=120&d=mm&r=g)
For this simple example yes, but if one of the nice things about the array constructors is that they know that lists, tuple and arrays are just sequences and any combination of them is valid numpy input. So for instance a list of tuples yields a 2d array. A list of tuples of 1d arrays yields a 3d array. A list of 1d arrays yields 2d array. This was the case consistently across all dtypes. Now it is the case across all of them except for the dtype=object which has this unusual behavior. I agree that vstack is a better choice when you know you have a list of arrays but it is less useful when you don't know and you can't force a type in the vstack function so there is no longer an equivalent to the dtype=object behavior: In [7]: numpy.array([numpy.array([1,2,3]), numpy.array([4,5,6])], dtype=object) Out[7]: array([[1, 2, 3], [4, 5, 6]], dtype=object) In [8]: numpy.vstack([numpy.array([1,2,3]), numpy.array([4,5,6])], dtype=object) --------------------------------------------------------------------------- exceptions.TypeError Traceback (most recent call last) TypeError: vstack() got an unexpected keyword argument 'dtype' On 8/31/06, Charles R Harris <charlesr.harris@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 8/31/06, Tom Denniston <tom.denniston@alum.dartmouth.org> wrote:
What are you trying to do? If you want integers: In [32]: a = array([array([1,2,3]), array([4,5,6])], dtype=int) In [33]: a.shape Out[33]: (2, 3) If you want objects, you have them: In [30]: a = array([array([1,2,3]), array([4,5,6])], dtype=object) In [31]: a.shape Out[31]: (2,) i.e, a is an array containing two array objects. Chuck
![](https://secure.gravatar.com/avatar/64d7caa0e3bf58428afc9c729765988f.jpg?s=120&d=mm&r=g)
But i have hetergenious arrays that have numbers and strings and NoneType, etc. Take for instance: In [11]: numpy.array([numpy.array([1,'A', None]), numpy.array([2,2,'Some string'])], dtype=object) Out[11]: array([[1, A, None], [2, 2, Some string]], dtype=object) In [12]: numpy.array([numpy.array([1,'A', None]), numpy.array([2,2,'Some string'])], dtype=object).shape Out[12]: (2, 3) Works fine in Numeric and pre beta numpy but in beta numpy versions i get: In [6]: numpy.array([numpy.array([1,'A', None]), numpy.array([2,2,'Some string'])], dtype=object) Out[6]: array([[1 A None], [2 2 Some string]], dtype=object) In [7]: numpy.array([numpy.array([1,'A', None]), numpy.array([2,2,'Some string'])], dtype=object).shape Out[7]: (2,) But a lists of lists still gives: In [9]: numpy.array([[1,'A', None], [2,2,'Some string']], dtype=object).shape Out[9]: (2, 3) And if you omit the dtype and use a list of arrays then you get a string array with 2,3 dimensions: In [11]: numpy.array([numpy.array([1,'A', None]), numpy.array([2,2,'Some string'])]).shape Out[11]: (2, 3) This new behavior strikes me as inconsistent. One of the things I love about numpy is the ufuncs, array constructors, etc don't care about whether you pass a combination of lists, arrays, tuples, etc. They just know what you _mean_. And what you _mean_ by a list of lists, tuple of arrays, list of arrays, etc is very consistent across constructors and functions. I think making an exception for dtype=object introduces a lot of inconsistencies and it isn't clear to me what is gained. Does anyone commonly use the array interface in a manner that this new behavior is actuallly favorable? I may be overlooking a common use case or something like that. On 8/31/06, Charles R Harris <charlesr.harris@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 8/31/06, Tom Denniston <tom.denniston@alum.dartmouth.org> wrote:
I think you want: In [59]: a = array([array([1,'A', None],dtype=object),array([2,2,'Some string'],dtype=object)]) In [60]: a.shape Out[60]: (2, 3) Which makes good sense to me. Chuck
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 8/31/06, Charles R Harris <charlesr.harris@gmail.com> wrote:
OK, I changed my mind. I think you are right and this is a bug. Using svn revision 3098 I get In [2]: a = array([1,'A', None]) --------------------------------------------------------------------------- exceptions.TypeError Traceback (most recent call last) /home/charris/<ipython console> TypeError: expected a readable buffer object Which is different than you get with beta 1 in any case. I think array should cast the objects in the list to the first common dtype, object in this case. So the previous should be shorthand for: In [3]: a = array([1,'A', None], dtype=object) In [4]: a.shape Out[4]: (3,) Chuck
![](https://secure.gravatar.com/avatar/64d7caa0e3bf58428afc9c729765988f.jpg?s=120&d=mm&r=g)
Yes one can take a toy example and hack it to work but I don't necessarily have control over the input as to whether it is a list of object arrays, list of 1d heterogenous arrays, etc. Before I didn't need to worry about the input because numpy understood that a list of 1d arrays is a 2d piece of data. Now it understands this for all dtypes except object. My question was is this new set of semantics preferable to the old. I think your example kind of proves my point. Does it really make any sense for the following two ways of specifying an array give such different results? They strike me as _meaning_ the same thing. Doesn't it seem inconsistent to you? In [13]: array([array([1,'A', None], dtype=object),array([2,2,'Some string'],dtype=object)], dtype=object).shape Out[13]: (2,) and In [14]: array([array([1,'A', None], dtype=object),array([2,2,'Some string'],dtype=object)]).shape Out[14]: (2, 3) So my question is what is the _advantage_ of the new semantics? The two examples above used to give the same results. In what cases is it preferable for them to give different results? How does it make life simpler? On 8/31/06, Charles R Harris <charlesr.harris@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Tom Denniston wrote:
So my question is what is the _advantage_ of the new semantics?
what if the list don't have the same length, and therefor can not be made into an array, now you get a weird result:
N.array([N.array([1,'A',None],dtype=object),N.array([2,2,'Somestring',5],dtype=object)]).shape ()
Now you get an Object scalar. but:
N.array([N.array([1,'A',None],dtype=object),N.array([2,2,'Somestring',5],dtype=object)],dtype=object).shape (2,)
Now you get a length 2 array, just like before: far more consistent. With the old semantics, if you test your code with arrays of different lengths, you'll get one thing, but if they then happen to be the same length in some production use, the whole thing breaks -- this is a Bad Idea. Object arrays are just plain weird, there is nothing you can do that will satisfy every need. I think it's best for the array constructor to not try to guess at what the hierarchy of sequences you *meant* to use. You can (and probably should) always be explicit with: dtype=object),N.array([2,2,'Somestring',5],dtype=object)]
A array([[1 A None], [2 2 Somestring 5]], dtype=object)
-Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/64d7caa0e3bf58428afc9c729765988f.jpg?s=120&d=mm&r=g)
I would think one would want to throw an error when the data has inconsistent dimensions. This is what numpy does for other dtypes: In [10]: numpy.array(([1,2,3], [4,5,6])) Out[10]: array([[1, 2, 3], [4, 5, 6]]) In [11]: numpy.array(([1,3], [4,5,6])) --------------------------------------------------------------------------- exceptions.TypeError Traceback (most recent call last) TypeError: an integer is required On 8/31/06, Christopher Barker <Chris.Barker@noaa.gov> wrote:
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Tom Denniston wrote:
I would think one would want to throw an error when the data has inconsistent dimensions.
But it doesn't have inconsistent dimensions - they are perfectly consistent with a (2,) array of objects. How is the code to know what you intended? With numeric types, it is unambiguous to march down through the sequences until you get a number. As a sequence is an object, there no way to unambiguously do this automatically. Perhaps the way to solve this is for the array constructor to take a "shape" or "rank" argument, so you could specify what you intend. But that's really just syntactic sugar to avoid for calling numpy.empty() first. Perhaps a numpy.object_array() constructor would be useful, although as I think about it, even specifying a shape or rank would not be unambiguous! This is a useful discussion. If we ever get a nd-array into the standard lib, I suspect that object arrays will get heavy use -- better to clean up the semantics now. Perhaps a Wiki page on building object arrays is called for. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 8/31/06, Christopher Barker <Chris.Barker@noaa.gov> wrote:
Same as it produces a float array from array([1,2,3.0]). Array is a complicated function for precisely these sort of reasons, but the convenience makes it worthwhile. So, if a list contains something that can only be interpreted as an object, dtype should be set to object. With numeric types, it is unambiguous to march down through the
Chuck
participants (3)
-
Charles R Harris
-
Christopher Barker
-
Tom Denniston