Fast way to convert (nested) list to numpy object array?
Hello, In my application I use nested, someitmes variable length lists, e.g. [[1,2], [1,2,3], ...]. These can also become double nested, etc. up to arbitrary complexity. I like to use numpy indicing on the outer list, i.e. I want to create: array([[1, 2], [1, 2, 3]], dtype=object) However, because numpy likes to 'walk' through the nested lists, this becomes rather slow when the nested lists are large, e.g. k = [range(i) for i in range(10000)] %timeit numpy.array(k) 1 loops, best of 3: 2.11 s per loop Compared to shorter lists, e.g: k2 = [range(numpy.random.randint(0,10)) for i in range(10000)] %timeit numpy.array(k2) 100 loops, best of 3: 2.7 ms per loop As I know beforehand that numpy does not have to descend into these objects, I would just like to create a 1-dimensional array. I thought about using fromiter, but his fails with: ValueError: cannot create object arrays from iterator A second approach I tried is to create an empty array, and then fill it: x = numpy.empty(len(k), dtype=object) %timeit x[:] = k 1000 loops, best of 3: 220 µs per loop This works already much, much better, but the loop still takes time to 'descend' into the objects if they have a fixed size, e.g.: k3 = [[range(10) for i in range(100)] for i in range(10000)] %timeit x[:] = k3 10 loops, best of 3: 45.6 ms per loop A python loop is in these cases even faster %timeit for pos, e in enumerate(k3): x[pos] = e 1000 loops, best of 3: 1.02 ms per loop This piece of code is quite time-critical in my application, and I observe slow downs due to this behaviour. My question therefore is if there is a fast way to just convert a list simply into a 1-dimensional object array, without each object being descended into? More in general, if i create an array with numpy.array(k), would it be possible to indicate that it should search only 1,2,... nested levels deep into k? Thanks for any advice, Marc
numpy descends into the lists even if you request a object dtype as it treats object arrays containing nested lists of equal size as ndimensional: np.array([[1,2], [3,4]], dtype=object).ndim 2 I don't think we have a constructor that limits the maximum dimension, only one the minimum dimension. I guess we could add one e.g. np.array(nested_list, dtype=object, ndmax=1) But I'm not sure if its really worth it, can't you somehow move the array construction out of your tight loops?
On Thu, Jul 3, 2014 at 11:30 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
numpy descends into the lists even if you request a object dtype as it treats object arrays containing nested lists of equal size as ndimensional:
np.array([[1,2], [3,4]], dtype=object).ndim 2
I don't think we have a constructor that limits the maximum dimension, only one the minimum dimension. I guess we could add one e.g. np.array(nested_list, dtype=object, ndmax=1) But I'm not sure if its really worth it, can't you somehow move the array construction out of your tight loops?
On second though I guess adding a short circuit to the dimension discovery on mismatching list length with object type should solve the issue too. A bit more information on the use case would still be useful, why do you need to use numpy arrays for this in the first place?
On 07/03/2014 11:43 AM, Julian Taylor wrote:
On second though I guess adding a short circuit to the dimension discovery on mismatching list length with object type should solve the issue too. A bit more information on the use case would still be useful, why do you need to use numpy arrays for this in the first place?
I use numpy as the base for a prototype data handling language (which matches dimensions not on position as in numpy, but by identity). This allows SQL like operations on complex data structures. The code has to be generic, to handle the corner cases. Numpy is used as it provides the fast indicing/ufuncs. Input is often formatted using regular Python constructs. This input data is 'unpacked' to a certain depth, which means that it is converted to numpy arrays, to allow for generic query operations. This can however go wrong. Say that we have nested variable length lists, what sometimes happens is that part of the data has (by chance) only fixed length nested lists, while another part has variable length nested lists. If we then unpack, numpy will for the first case construct a multi-dimensional array, while for the second case it will construct a single-dimensional array of nested lists. If we then want to e.g. concatenate this data using a generic operation, it will have trouble to handle the mix of multi-dimensional and 1-dimensional arrays. The code becomes quite a bit simpler if I know at forehand that I can expect just e.g. a 1-dimensional array. This is maybe somewhat of a corner case :) However, I was still wondering why, when assigning x[:] = k, k is still 'descended into' further than needed given the limited dimension of x. This seems unnecessary? Also, it is also not really clear to me why fromiter does not work using object dtypes. A solution for these two more general problems would already help me a lot. The generic solution of adding an nmaxdim parameter to numpy.array would of course be even more ideal :)
On Do, 2014-07-03 at 14:36 +0200, Marc Hulsman wrote:
On 07/03/2014 11:43 AM, Julian Taylor wrote:
On second though I guess adding a short circuit to the dimension discovery on mismatching list length with object type should solve the issue too. A bit more information on the use case would still be useful, why do you need to use numpy arrays for this in the first place?
I use numpy as the base for a prototype data handling language (which matches dimensions not on position as in numpy, but by identity). This allows SQL like operations on complex data structures. The code has to be generic, to handle the corner cases. Numpy is used as it provides the fast indicing/ufuncs.
Input is often formatted using regular Python constructs. This input data is 'unpacked' to a certain depth, which means that it is converted to numpy arrays, to allow for generic query operations.
This can however go wrong. Say that we have nested variable length lists, what sometimes happens is that part of the data has (by chance) only fixed length nested lists, while another part has variable length nested lists. If we then unpack, numpy will for the first case construct a multi-dimensional array, while for the second case it will construct a single-dimensional array of nested lists. If we then want to e.g. concatenate this data using a generic operation, it will have trouble to handle the mix of multi-dimensional and 1-dimensional arrays. The code becomes quite a bit simpler if I know at forehand that I can expect just e.g. a 1-dimensional array.
This is maybe somewhat of a corner case :) However, I was still wondering why, when assigning x[:] = k, k is still 'descended into' further than needed given the limited dimension of x. This seems unnecessary? Also, it is also not really clear to me why fromiter does not work using object dtypes. A solution for these two more general problems would already help me a lot.
True and true. I don't see a problem with fromiter being more general, just someone has to sit down and add new error checks/cleanup stuff for the object case. The assignment could probably also be optimized, not sure how hard that is, I would expect it isn't that hard. As usually, someone just needs to find time and the interest to actually do it ;). - Sebastian
The generic solution of adding an nmaxdim parameter to numpy.array would of course be even more ideal :)
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 07/03/2014 02:44 PM, Sebastian Berg wrote:
True and true. I don't see a problem with fromiter being more general, just someone has to sit down and add new error checks/cleanup stuff for the object case. The assignment could probably also be optimized, not sure how hard that is, I would expect it isn't that hard. As usually, someone just needs to find time and the interest to actually do it ;). - Sebastian
I looked at the code of FromIter below. /* * We would need to alter the memory RENEW code to decrement any * reference counts before throwing away any memory. */ if (PyDataType_REFCHK(dtype)) { PyErr_SetString(PyExc_ValueError, "cannot create object arrays from iterator"); goto done; } However, the memory renew code (which just reallocs the memory to increase the array size) uses a simple realloc. It seems to me that it is not necessary to adapt reference counts in this case (as the incref from the new memory compensates the decref from the memory that is removed)? For the addition of elements to the array, everything seems to be ok anyway, as setitem is used, which does the incref already. So I think it should be possible to just remove this check? I did not yet look at the assignment issue, had some difficulty finding the correct place in the code, does does anyone have any pointers were to look?
The generic solution of adding an nmaxdim parameter to numpy.array would of course be even more ideal :)
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fr, 2014-07-04 at 17:32 +0200, Marc Hulsman wrote:
On 07/03/2014 02:44 PM, Sebastian Berg wrote:
True and true. I don't see a problem with fromiter being more general, just someone has to sit down and add new error checks/cleanup stuff for the object case. The assignment could probably also be optimized, not sure how hard that is, I would expect it isn't that hard. As usually, someone just needs to find time and the interest to actually do it ;). - Sebastian
I looked at the code of FromIter below.
/* * We would need to alter the memory RENEW code to decrement any * reference counts before throwing away any memory. */ if (PyDataType_REFCHK(dtype)) { PyErr_SetString(PyExc_ValueError, "cannot create object arrays from iterator"); goto done; }
However, the memory renew code (which just reallocs the memory to increase the array size) uses a simple realloc. It seems to me that it is not necessary to adapt reference counts in this case (as the incref from the new memory compensates the decref from the memory that is removed)? For the addition of elements to the array, everything seems to be ok anyway, as setitem is used, which does the incref already. So I think it should be possible to just remove this check?
Yes and no. I agree that the comment was just being overly careful, since the renew will copy the pointers without calling Py_INCREF. However, you *will* need to add new error cleanup logic in case the iterator throws an error, or you run out of memory. Since then you need to decref everything again.
I did not yet look at the assignment issue, had some difficulty finding the correct place in the code, does does anyone have any pointers were to look?
This is handled by PyArray_CopyObject in arrayobject.c. The actual logic is probably done by PyArray_GetArrayParamsFromObject in ctors.c, that is a public function, so my guess is, you would have to create a new one which allows passing in a maximum ndim and then make the old one call that one with NPY_MAXDIMS (or whatever it was) - Sebastian
The generic solution of adding an nmaxdim parameter to numpy.array would of course be even more ideal :)
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Jul 3, 2014 at 5:36 AM, Marc Hulsman <m.hulsman@tudelft.nl> wrote:
This can however go wrong. Say that we have nested variable length lists, what sometimes happens is that part of the data has (by chance) only fixed length nested lists, while another part has variable length nested lists. If we then unpack, numpy will for the first case construct a multi-dimensional array, while for the second case it will construct a single-dimensional array of nested lists. If we then want to e.g. concatenate this data using a generic operation, it will have trouble to handle the mix of multi-dimensional and 1-dimensional arrays. The code becomes quite a bit simpler if I know at forehand that I can expect just e.g. a 1-dimensional array.
Pandas has a couple of awkward work-arounds to do just that (creating object arrays). Might be worth taking a look: https://github.com/pydata/pandas/blob/master/pandas/lib.pyx#L315 https://github.com/pydata/pandas/blob/master/pandas/core/common.py#L2124 Cheers, Stephan
On Thu, Jul 3, 2014 at 3:30 AM, Julian Taylor <jtaylor.debian@googlemail.com
wrote:
numpy descends into the lists even if you request a object dtype as it treats object arrays containing nested lists of equal size as ndimensional:
np.array([[1,2], [3,4]], dtype=object).ndim 2
I don't think we have a constructor that limits the maximum dimension, only one the minimum dimension.
There was discussion of such some years ago specifically for the object case. I think it would be useful. <snip> Chuck
participants (5)
-
Charles R Harris -
Julian Taylor -
Marc Hulsman -
Sebastian Berg -
Stephan Hoyer