Mailman 3 creation of ndarray with dtype=np.object : bug? - NumPy-Discussion

newer
Uint64 casting bug for MSVC builds

creation of ndarray with dtype=np.object : bug?

older
Setting up a "newcomers" label on...

Emanuele Olivetti

2 Dec 2014 2 Dec '14

5:23 p.m.

Hi, I am using 2D arrays where only one dimension remains constant, e.g.: --- import numpy as np a = np.array([[1, 2, 3], [4, 5, 6]]) # 2 x 3 b = np.array([[9, 8, 7]]) # 1 x 3 c = np.array([[1, 3, 5], [7, 9, 8], [6, 4, 2]]) # 3 x 3 d = np.array([[5, 5, 4], [4, 3, 3]]) # 2 x 3 --- I have a large number of them and need to extract subsets of them through fancy indexing and then stack them together. For this reason I put them into an array of dtype=np.object, given their non-constant nature. Indexing works well :) but stacking does not :( , as you can see in the following example: --- # fancy indexing :) data = np.array([a, b, c, d], dtype=np.object) idx = [0, 1, 3] print(data[idx]) In [1]: [[[1 2 3] [4 5 6]] [[9 8 7]] [[5 5 4] [4 3 3]]] # stacking :( data2 = np.array([a, b, c], dtype=np.object) data3 = np.array([a, d], dtype=np.object) together = np.vstack([data2, data3]) In [2]: --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-14-7ebee5709e29> in <module>() ----> 1 execfile(r'/tmp/python-3276515J.py') # PYTHON-MODE /tmp/python-3276515J.py in <module>() 1 data2 = np.array([a, b, c], dtype=np.object) 2 data3 = np.array([a, d], dtype=np.object) ----> 3 together = np.vstack([data2, data3]) /usr/lib/python2.7/dist-packages/numpy/core/shape_base.pyc in vstack(tup) 224 225 """ --> 226 return _nx.concatenate(map(atleast_2d,tup),0) 227 228 def hstack(tup): ValueError: arrays must have same number of dimensions ---- The reason of the error is that data2.shape is "(2,)", while data3.shape is "(2, 2, 3)". This happens because the creation of ndarrays with dtype=np.object tries to be "smart" and infer the common dimensions between the objects you put in the array instead of just creating an array of the objects you give. This leads to unexpected results when you use it, like the one in the example, because you cannot control the resulting shape, which is data dependent. Or at least I cannot find a way to create data3 with shape (2,)... How should I address this issue? To me, it looks like a bug in the excellent NumPy. Best, Emanuele

Show replies by date

Ryan Nelson

3 Dec 3 Dec

9:02 a.m.

New subject: creation of ndarray with dtype=np.object : bug?

Emanuele Olivetti writes:

...

Hi,

I am using 2D arrays where only one dimension remains constant, e.g.: --- import numpy as np a = np.array([[1, 2, 3], [4, 5, 6]]) # 2 x 3 b = np.array([[9, 8, 7]]) # 1 x 3 c = np.array([[1, 3, 5], [7, 9, 8], [6, 4, 2]]) # 3 x 3 d = np.array([[5, 5, 4], [4, 3, 3]]) # 2 x 3 --- I have a large number of them and need to extract subsets of them through fancy indexing and then stack them together. For this reason I put them into an array of dtype=np.object, given their non-constant nature. Indexing works well :) but stacking does not :( , as you can see in the following example: --- # fancy indexing :) data = np.array([a, b, c, d], dtype=np.object) idx = [0, 1, 3] print(data[idx]) In [1]: [[[1 2 3] [4 5 6]] [[9 8 7]] [[5 5 4] [4 3 3]]]

# stacking :( data2 = np.array([a, b, c], dtype=np.object) data3 = np.array([a, d], dtype=np.object) together = np.vstack([data2, data3]) In [2]: ----------------------------------------------------------------------

...

ValueError Traceback (most recent call last) <ipython-input-14-7ebee5709e29> in <module>() ----> 1 execfile(r'/tmp/python-3276515J.py') # PYTHON-MODE

/tmp/python-3276515J.py in <module>() 1 data2 = np.array([a, b, c], dtype=np.object) 2 data3 = np.array([a, d], dtype=np.object) ----> 3 together = np.vstack([data2, data3])

/usr/lib/python2.7/dist-packages/numpy/core/shape_base.pyc in vstack(tup) 224 225 """ --> 226 return _nx.concatenate(map(atleast_2d,tup),0) 227 228 def hstack(tup):

ValueError: arrays must have same number of dimensions ---- The reason of the error is that data2.shape is "(2,)", while data3.shape is "(2, 2, 3)". This happens because the creation of ndarrays with dtype=np.object

...

"smart" and infer the common dimensions between the objects you put in

----- tries to be the array

...

instead of just creating an array of the objects you give. This leads to unexpected results when you use it, like the one in the example, because you cannot control the resulting shape, which is data dependent. Or at least I cannot find a way to create data3 with shape (2,)...

How should I address this issue? To me, it looks like a bug in the excellent NumPy.

Best,

Emanuele

Emanuele, This doesn't address your question directly. However, I wonder if you could approach this problem from a different way to get what you want. First of all, create a "index" array and then just vstack all of your arrays at once. ----- import numpy as np a = np.array([[1, 2, 3], [4, 5, 6]]) # 2 x 3 b = np.array([[9, 8, 7]]) # 1 x 3 c = np.array([[1, 3, 5], [7, 9, 8], [6, 4, 2]]) # 3 x 3 d = np.array([[5, 5, 4], [4, 3, 3]]) # 2 x 3 all_array = [a, b, c, d] z = [] np.array([z.extend([n,]*i.shape[0]) for n, i in enumerate(all_array)]) z = np.array(z) varrays = np.vstack(all_array) ---- Now z looks like this `array([0, 0, 1, 2, 2, 2, 3, 3])` and varrays is a vstack of all your data. To select one of your arrays, you can do something like the following. ----- [In]: varrays[ z == 2 ] # Array c [Out]: array([[1, 3, 5], [7, 9, 8], [6, 4, 2]]) ----- Now, if you want to select both arrays b and d, for example, you would need a boolean array that looks like this: array([False, False, True, False, False, False, True, True]) I think there is some Numpy black magic that let's you do this easily (e.g. `i_wish = z == [1,3]`), but right now, I can only think about how to do this with a loop: ---- idxs = np.zeros(z.shape, dtype=bool) for i in [1,3]: idxs = np.logical_or(idxs, z == i) idxs ---- This lets you select from the large loop and get the vstacked arrays automatically. ---- [In]: varrays[idxs] [Out]: array([[9, 8, 7], [5, 5, 4], [4, 3, 3]]) ----- Sorry if this does not help. Just spit-balling... Ryan

Emanuele Olivetti

3:51 p.m.

On 12/03/2014 04:32 AM, Ryan Nelson wrote:

...

Emanuele,

This doesn't address your question directly. However, I wonder if you could approach this problem from a different way to get what you want.

First of all, create a "index" array and then just vstack all of your arrays at once.

Ryan, Thank you for your solution. Indeed it works. But it seems to me that manually creating an index and re-implementing slicing should be the last resort. NumPy is *great* and provides excellent slicing and assembling tools. For some reason, that I don't fully understand, when dtype=np.object the ndarray constructor tries to be "smart" and creates unexpected results that cannot be controlled. Another simple example: --- import numpy as np from numpy.random import rand, randint n_arrays = 4 shape0_min = 2 shape0_max = 4 for a in range(30): list_of_arrays = [rand(randint(shape0_min, shape0_max), 3) for i in range(n_arrays)] array_of_arrays = np.array(list_of_arrays, dtype=np.object) print("shape: %s" % (array_of_arrays.shape,)) --- the usual output is: shape: (4,) but from time to time, when the randomly generated arrays have - by chance - the same shape, you get: shape: (4, 2, 3) which may crash your code at runtime. To NumPy developers: is there a specific reason for np.array(..., dtype=np.object) to be "smart" instead of just assembling an array with the provided objects? Best, Emanuele

Jaime Fernández del Río

4:47 p.m.

New subject: creation of ndarray with dtype=np.object : bug?

On Wed, Dec 3, 2014 at 2:21 AM, Emanuele Olivetti wrote:

...

On 12/03/2014 04:32 AM, Ryan Nelson wrote:

...
Emanuele,

This doesn't address your question directly. However, I wonder if you could approach this problem from a different way to get what you want.

First of all, create a "index" array and then just vstack all of your arrays at once.

Ryan,

Thank you for your solution. Indeed it works. But it seems to me that manually creating an index and re-implementing slicing should be the last resort. NumPy is *great* and provides excellent slicing and assembling tools. For some reason, that I don't fully understand, when dtype=np.object the ndarray constructor tries to be "smart" and creates unexpected results that cannot be controlled.

Another simple example: --- import numpy as np from numpy.random import rand, randint n_arrays = 4 shape0_min = 2 shape0_max = 4 for a in range(30): list_of_arrays = [rand(randint(shape0_min, shape0_max), 3) for i in range(n_arrays)] array_of_arrays = np.array(list_of_arrays, dtype=np.object) print("shape: %s" % (array_of_arrays.shape,)) --- the usual output is: shape: (4,) but from time to time, when the randomly generated arrays have - by chance - the same shape, you get: shape: (4, 2, 3) which may crash your code at runtime.

To NumPy developers: is there a specific reason for np.array(..., dtype=np.object) to be "smart" instead of just assembling an array with the provided objects?

The safe way to create 1D object arrays from a list is by preallocating them, something like this:

...

...
...
a = [np.random.rand(2, 3), np.random.rand(2, 3)] b = np.empty(len(a), dtype=object) b[:] = a b array([ array([[ 0.124382 , 0.04489531, 0.93864908], [ 0.77204758, 0.63094413, 0.55823578]]), array([[ 0.80151723, 0.33147467, 0.40491018], [ 0.09905844, 0.90254708, 0.69911945]])], dtype=object)

It's only a tad more verbose than your current code, and you can always wrap it in a helper function if you find 2 lines of code to be too many. As to why np.array tries to be smart, keep in mind that there are other applications of object arrays than having stacked sequences. The following code computes the 100-th Fibonacci number using the matrix form of the recursion (http://en.wikipedia.org/wiki/Fibonacci_number#Matrix_form), numpy's linear algebra capabilities, and Python's arbitrary precision ints:

...

...
...
a = np.array([[0, 1], [1, 1]], dtype=object) np.linalg.matrix_power(a, 99)[0, 0] 135301852344706746049L

Trying to do this with any other type would result in either wrong results due to overflow:

...

...
...
a = np.array([[0, 1], [1, 1]]) np.linalg.matrix_power(a, 99)[0, 0] -90618175

or lost precision:

...

...
...
a = np.array([[0, 1], [1, 1]], dtype=np.double) np.linalg.matrix_power(a, 99)[0, 0] 1.3530185234470674e+20

Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

Emanuele Olivetti

5:32 p.m.

On 12/03/2014 12:17 PM, Jaime Fernández del Río wrote:

...

The safe way to create 1D object arrays from a list is by preallocating them, something like this:

...
...
...
a = [np.random.rand(2, 3), np.random.rand(2, 3)] b = np.empty(len(a), dtype=object) b[:] = a b array([ array([[ 0.124382 , 0.04489531, 0.93864908], [ 0.77204758, 0.63094413, 0.55823578]]), array([[ 0.80151723, 0.33147467, 0.40491018], [ 0.09905844, 0.90254708, 0.69911945]])], dtype=object)

Thank you for the compact way to create 1D object arrays. Definitely useful!

...

As to why np.array tries to be smart, keep in mind that there are other applications of object arrays than having stacked sequences. The following code computes the 100-th Fibonacci number using the matrix form of the recursion (http://en.wikipedia.org/wiki/Fibonacci_number#Matrix_form), numpy's linear algebra capabilities, and Python's arbitrary precision ints:

...
...
...
a = np.array([[0, 1], [1, 1]], dtype=object) np.linalg.matrix_power(a, 99)[0, 0] 135301852344706746049L

Trying to do this with any other type would result in either wrong results due to overflow:

[...]

I guess that the problem I am referring to does not refer only to stacked sequences and it is more general. Moreover I do agree that on the example you present: the array creation explores the list of lists and create a 2D array of Python int instead of np.int64. Exploring iterable containers is certainly correct in general. I am wondering whether it should be prevented in some cases, where the semantic is clear from the syntax, e.g. when the nature of the container changes (see below). To me this is intuitive and correct:

...

...
...
a = np.array([[0, 1], [1, 1]], dtype=object) a.shape (2, 2) while this is counterintuitive and potentially error-prone: b = np.array([np.array([0, 1]), np.array([0, 1])], dtype=object) b.shape (2, 2) because it is clear that I meant a list of two vectors, i.e. an array of shape (2,), and not a 2D array of shape (2, 2).

Best, Emanuele

3429

Age (days ago)

3430

Last active (days ago)

List overview

Download

4 comments

3 participants

participants (3)

Emanuele Olivetti
Jaime Fernández del Río
Ryan Nelson

creation of ndarray with dtype=np.object : bug?

Emanuele Olivetti

Ryan Nelson

Emanuele Olivetti

Jaime Fernández del Río

Emanuele Olivetti

tags

participants (3)