MaskedArray __setitem__ Performance
In part of some code I'm rewriting from carrying around a data and mask array to using MaskedArray, I read data into an array from an input stream. By its nature this a "one at a time" process, so it is basically a loop over assigning single elements (in no predetermined order) of already allocated arrays. Unfortunately, using MaskedArray in this way is significantly slower. The sample code below demonstrates that for this particular procedure, filling the MaskedArray is 32x slower than working with the two arrays I had been carying around. It appears that I can regain the fill performance by working on _data and _mask directly. I can guarantee that the MaskedArrays I'm working with have been created with a dense mask as I've done below (there are always some masked elements, so there is no gain in shrinking to nomask). Is this safe? If not, can I make it safe for this particular performance critical section? I'm assuming that doing array operations won't incur this sort of penalty when I get further into my translation. Some overhead is acceptable for the convenience of not dragging around the mask and thinking about it all of the time, but hopefully less than 2x slower. Thanks! Alex import numpy def get_ndarrays(): return (numpy.zeros((5000,500), dtype=float), numpy.ones((5000,500), dtype=bool)) import timeit t_base = timeit.Timer( 'a[0,0] = 1.0; m[0,0] = False', 'from __main__ import get_ndarrays; a,m = get_ndarrays()' ).timeit(1000)/1000 print t_base 6.97574691756e007 import numpy.ma def get_maskedarray(): return numpy.ma.MaskedArray( numpy.zeros((5000,500), dtype=float), numpy.ones((5000,500), dtype=bool) ) t_ma = timeit.Timer( 'a[0,0] = 1.0', 'from __main__ import get_maskedarray; a = get_maskedarray()' ).timeit(1000)/1000 print t_ma, t_ma/t_base 2.26880790715e005 32.5242290749 t_ma_com = timeit.Timer( 'd[0,0] = 1.0; m[0,0] = False', 'from __main__ import get_maskedarray, get_setter; a = get_maskedarray(); d,m = a._data,a._mask' ).timeit(1000)/1000 print t_ma_com, t_ma_com/t_base 7.34450886914e007 1.05286343612
Alexander, You get the gist here: process your _data and _mask separately and recombine them into a MaskedArray at the end. That way, you'll skip most of the overhead costs brought by some tests in the package (in __getitem__, __setitem__...).
On Feb 16, 2008 12:25 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
Alexander, You get the gist here: process your _data and _mask separately and recombine them into a MaskedArray at the end. That way, you'll skip most of the overhead costs brought by some tests in the package (in __getitem__, __setitem__...).
Can I safely carry around the data, mask and MaskedArray? I'm considering working along the lines of the following conceptual outline: d = numpy.array(shape, dtype) m = numpy.array(shape, bool) a = numpy.ma.MaskedArray(d, m) load_initial_data(d, m) for update in updates: apply_update(update, d, m) result = calculate_result(a) I guess the alternative would be like: d = numpy.array(shape, dtype) m = numpy.array(shape, bool) load_initial_data(d, m) for update in updates: apply_update(update, d, m) a = numpy.ma.MaskedArray(d, m) result = calculate_result(a) Perhaps this is cleaner in some ways, but I'm trying to squeeze the most performance out of the basic update loop I've sketched, so that the calculate_result function can afford to exchange some performance for clarity and simplicity (if desired). I haven't yet measured the overhead in creating a MaskedArray, but there probably isn't much since by default no copies are made. Thanks for your advice, Alex
Can I safely carry around the data, mask and MaskedArray? I'm considering working along the lines of the following conceptual outline:
That depends a lot on what calculate_results does, and whether you update the arrays in place or not.
d = numpy.array(shape, dtype) m = numpy.array(shape, bool) a = numpy.ma.MaskedArray(d, m)
You should be able to update d and m, and have the changes passed to a (as long as you're not using copy=True). You have to make sure that m has indeed a dtype of MaskType (or bool), else you'll break the connection. Explanation: in MaskedArray.__new__, the mask argument is converted to a dtype of MaskType (bool): if the mask is originally in integer, for example, a copy is made, and the _mask of your masked array does not point to `mask`. For example:
d=numpy.array([1,2,3]) m=numpy.array([0,0,1]) x=numpy.ma.array(d,mask=m) x [1 2 ] d[0]=17 x [17 2 ]
OK, x is properly updated. If now we try to change the mask:
m[0]=1 x [17 2 ]
x is not updated, as x._mask doesn't point to m, but to a copy of m as the dtype changed from int to bool. Now, if we ensure that m is an array of booleans:
d=numpy.array([1,2,3]) m=numpy.array([0,0,1], dtype=bool) x=numpy.ma.array(d,mask=m) print x [1 2 ] d[0]=17 print x [17 2 ] m[0]=1 print x [ 2 ] m was of the correct dtype in the first place, so no copy is made, and x._mask does point to m.
In short: in your example, updating d and m should work and be more efficient than updating a directly.
On Feb 16, 2008 3:21 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
Can I safely carry around the data, mask and MaskedArray? I'm considering working along the lines of the following conceptual outline:
That depends a lot on what calculate_results does, and whether you update the arrays in place or not.
d = numpy.array(shape, dtype) m = numpy.array(shape, bool) a = numpy.ma.MaskedArray(d, m)
You should be able to update d and m, and have the changes passed to a (as long as you're not using copy=True). You have to make sure that m has indeed a dtype of MaskType (or bool), else you'll break the connection.
Explanation: in MaskedArray.__new__, the mask argument is converted to a dtype of MaskType (bool): if the mask is originally in integer, for example, a copy is made, and the _mask of your masked array does not point to `mask`. For example:
d=numpy.array([1,2,3]) m=numpy.array([0,0,1]) x=numpy.ma.array(d,mask=m) x [1 2 ] d[0]=17 x [17 2 ]
OK, x is properly updated. If now we try to change the mask:
m[0]=1 x [17 2 ]
x is not updated, as x._mask doesn't point to m, but to a copy of m as the dtype changed from int to bool. Now, if we ensure that m is an array of booleans:
d=numpy.array([1,2,3]) m=numpy.array([0,0,1], dtype=bool) x=numpy.ma.array(d,mask=m) print x [1 2 ] d[0]=17 print x [17 2 ] m[0]=1 print x [ 2 ] m was of the correct dtype in the first place, so no copy is made, and x._mask does point to m.
In short: in your example, updating d and m should work and be more efficient than updating a directly.
Cool. Thanks!
participants (2)

Alexander Michael

Pierre GM