ANN: MaskedArray as a subclass of ndarray - followup
All, I've updated this famous reimplementation of maskedarray I keep ranting about. A new feature has been introduced : hard_mask. When a masked array is created with the flag hard_mask=True, the mask can only grow, not shrink. In other terms, masked values cannot be unmasked. The flag hard_mask is set to False by default. You can toggle the behavior with the `harden_mask` and `soften_mask` methods.
import maskedarray as MA x=MA.array([1,2,3],mask=[1,0,0], hard_mask=True) x[0]=999 print x [-- 2 3] x.soften_mask() x[0]=999 print x [999 2 3]
I also put the file `timer_comparison.py`, that runs some unittests with each implementation (numpy.core.ma and maskedarray), and outputs the minimum times. On my machine, there doesn't seem to be a lot of differences, maskedarray being slightly faster. However, I'm not sure whether I can really trust the results. What would be the best way to compare the relative performances of numpy.core.ma and maskedarray ? Should I run tests on a function basis ? Thanks in advance for any idea/suggestion. P.
Pierre GM wrote:
All, I've updated this famous reimplementation of maskedarray I keep ranting about. [...] I also put the file `timer_comparison.py`, that runs some unittests with each implementation (numpy.core.ma and maskedarray), and outputs the minimum times. On my machine, there doesn't seem to be a lot of differences, maskedarray being slightly faster.
Same for mine: Thinkpad T41, Pentium M, ubuntu Edgy: efiring@manini:~/programs/py/tests$ python timer_comparison.py #1.................................................. numpy.core.ma: 0.492 - 0.493 maskedarray : 0.481 - 0.482 #2.................................................. numpy.core.ma: 1.440 - 1.440 maskedarray : 1.215 - 1.215 #3.................................................. numpy.core.ma: 2.272 - 2.274 maskedarray : 2.156 - 2.156 I admit that I have not studied the question, but my impression is that you have made some nice improvements. Numpy unified the Numeric/numarray split, but now we have a MaskedArray split. Any prospect for unification, say in numpy 1.1? Might it make sense for maskedarray to replace numpy.core.ma in 1.1? Eric
Eric Firing wrote:
Pierre GM wrote:
All, I've updated this famous reimplementation of maskedarray I keep ranting about.
[...]
I also put the file `timer_comparison.py`, that runs some unittests with each implementation (numpy.core.ma and maskedarray), and outputs the minimum times. On my machine, there doesn't seem to be a lot of differences, maskedarray being slightly faster.
Same for mine: Thinkpad T41, Pentium M, ubuntu Edgy:
efiring@manini:~/programs/py/tests$ python timer_comparison.py #1.................................................. numpy.core.ma: 0.492 - 0.493 maskedarray : 0.481 - 0.482 #2.................................................. numpy.core.ma: 1.440 - 1.440 maskedarray : 1.215 - 1.215 #3.................................................. numpy.core.ma: 2.272 - 2.274 maskedarray : 2.156 - 2.156
I admit that I have not studied the question, but my impression is that you have made some nice improvements. Numpy unified the Numeric/numarray split, but now we have a MaskedArray split. Any prospect for unification, say in numpy 1.1? Might it make sense for maskedarray to replace numpy.core.ma in 1.1?
This makes sense to me. I'm generally favorable to the new maskedarray (I actually like the idea of it being a sub-class). I'm just waiting for people that actually use the MaskedArray to comment. For 1.1 I would really like to move most of the often-used sub-classes of the ndarray to the C-level and merge in functionality from CVXOPT. -Travis
This makes sense to me. I'm generally favorable to the new maskedarray (I actually like the idea of it being a sub-class). I'm just waiting for people that actually use the MaskedArray to comment.
For 1.1 I would really like to move most of the often-used sub-classes of the ndarray to the C-level and merge in functionality from CVXOPT.
-Travis
I am definitely in favor of the new maskedarray implementation. I've been working with Pierre on a time series module which is a subclass of the new masked array implementation, and having it as a subclass of ndarray definitely has advantages (and no real disadvantages that I am aware of). Moving the implementation to the C-level would be awesome. In particular, __getitem__ and __setitem__ are incredibly slow with masked arrays compared to ndarrays, so using those inside python loops is basically a really bad idea currently. You always have to work with the _data and _mask attributes directly if you are concerned about performance. Also, there is a "bug" in Pierre's current implementation I spoke with him about, but currently have no solution for. numpy.add.accumulate doesn't work on arrays from the new maskedarray implementation, but does with the old one. The problem seems to arise when you over-ride __getitem__ in an ndarray sub-class. See the code below for a demonstration: import numpy import numpy.core.umath as umath from numpy.core.numeric import ndarray import numpy.core.numeric as numeric class Foo1(numeric.ndarray): def __new__(self, data=None): _data = numeric.array(data) return numeric.asanyarray(_data).view(self) def __array_finalize__(self, obj): if not hasattr(self, "_data"): if hasattr(obj,'_data'): self._data = obj._data else: self._data = obj def __array__ (self, t=None, context=None): return self._data def __array_wrap__(self, array, context=None): return Foo1(array) """ if you define this to return something other than what standard ndarray returns, accumulate doesn't work""" def __getitem__(self, index): return self._data[index] #return super(Foo1, self).__getitem__(index) class Foo2(object): def __init__(self, data=None): self._data = numeric.array(data) def __array__ (self, t=None, context=None): return self._data def __array_wrap__(self, array, context=None): return Foo2(array) def __getitem__(self, index): return self._data[index] def __str__(self): return str(self._data) def __add__(self, other): return umath.add(self._data, other._data) if __name__ == "__main__": from numpy import add ac = add.accumulate foo1 = Foo1([1,2,3,4]) foo2 = Foo2([1,2,3,4]) print ac(foo1), ac(foo2)
Matt Knox schrieb:
I am definitely in favor of the new maskedarray implementation. I've been working with Pierre on a time series module which is a subclass of the new masked array implementation, and having it as a subclass of ndarray definitely has advantages (and no real disadvantages that I am aware of).
That time series module sounds very interesting! Is it available somewhere, or some documentation? Thank you, Sven
On Fri, Jan 19, 2007 at 10:56:16AM +0100, Sven Schreiber wrote:
Matt Knox schrieb:
I am definitely in favor of the new maskedarray implementation. I've been working with Pierre on a time series module which is a subclass of the new masked array implementation, and having it as a subclass of ndarray definitely has advantages (and no real disadvantages that I am aware of).
That time series module sounds very interesting! Is it available somewhere, or some documentation?
It's in the scipy sandbox. Edit scipy/Lib/sandbox/enabled_packages.txt and add 'timeseries' on a line of its own, then recompile. Cheers Stéfan
That time series module sounds very interesting! Is it available somewhere, or some documentation?
Thank you, Sven
Not really any documentation yet, and the code is still in a state of flux, so expect frequent changes/additions at this point still. No concern is being given to "backwards compatability" right now either, so if you download the code now, expect stuff to break when you update to a newer version. Documentation will be available eventually, it's just a matter of time. You can look at the example script for some ideas, but the example script isn't fully updated to reflect the latest code either. If you do play around with it though, I would love to hear your thoughts on it and any criticisms/suggestions you may have. - Matt
On Thu, Jan 18, 2007 at 06:18:22PM +0000, Matt Knox wrote:
For 1.1 I would really like to move most of the often-used sub-classes of the ndarray to the C-level and merge in functionality from CVXOPT.
Moving the implementation to the C-level would be awesome. In particular, __getitem__ and __setitem__ are incredibly slow with masked arrays compared to ndarrays, so using those inside python loops is basically a really bad idea currently. You always have to work with the _data and _mask attributes directly if you are concerned about performance.
Moving the implementation to the C-level also has its downside. To me, at least, Python code is much more readable and hence easier to maintain. Is there a way that we can implement only the speed-critical methods in C? Cheers Stéfan
Moving the implementation to the C-level also has its downside. To me, at least, Python code is much more readable and hence easier to maintain.
Is there a way that we can implement only the speed-critical methods in C?
Cheers Stéfan
Implementing the whole thing in C also has the side benefit of the possibility making a nice C level api available to these sub-classes. And I suspect the core numpy developers are comfortable enough with C that maintainability is *probably* not a huge concern here. But yeah, implementing even just the speed critical parts in C would still be a nice improvement. - Matt
On Fri, Jan 19, 2007 at 02:13:52PM +0000, Matt Knox wrote:
Moving the implementation to the C-level also has its downside. To me, at least, Python code is much more readable and hence easier to maintain.
Is there a way that we can implement only the speed-critical methods in C?
Cheers Stéfan
Implementing the whole thing in C also has the side benefit of the possibility making a nice C level api available to these sub-classes. And I suspect the core numpy developers are comfortable enough with C that maintainability is *probably* not a huge concern here.
A "nice C level api" sounds like the definition of oxymoron :) Why would we argue for more C than absolutely necessary in a Python-based library? It's not trivial to code or maintain C API based code properly. Isn't that, to some extent at least, why projects like PyPy receive so much attention? I wouldn't be surprised if such a translation of MA has many more bugs than the Python version, unless consistent and rigid unit testing is being done. Have a nice weekend! Cheers Stéfan
El dv 19 de 01 del 2007 a les 18:08 +0200, en/na Stefan van der Walt va escriure:
On Fri, Jan 19, 2007 at 02:13:52PM +0000, Matt Knox wrote:
Moving the implementation to the C-level also has its downside. To me, at least, Python code is much more readable and hence easier to maintain.
Is there a way that we can implement only the speed-critical methods in C?
Cheers Stéfan
Implementing the whole thing in C also has the side benefit of the possibility making a nice C level api available to these sub-classes. And I suspect the core numpy developers are comfortable enough with C that maintainability is *probably* not a huge concern here.
A "nice C level api" sounds like the definition of oxymoron :) Why would we argue for more C than absolutely necessary in a Python-based library?
I agree. In that sense, it would be nice to have a look at Pyrex and implement the main classes with it. Its syntax is very similar to Python, so doing this kind of translation should be pretty easy (at least, much more than doing it in pure C). Pyrex has parts of its syntax that are more C-flavored, and meant to reach high-performance (similar to C code) in the parts of the code that really need it, or allows you to link with pure C code easily. Its only major disavantage from my point of view is that it doesn't have support for easy indexing multi-dimensional arrays (you have to simulate the indexes through unidimensional indexes, but this has never been a problem for me) but if the amount of multidimensional job that you have to do on your code is small (if any), then, it becomes a very poweful tool for doing extensions IMO (you already should know that the random sub-package in NumPy has been built using Pyrex). My 2 cents, -- Francesc Altet | Be careful about using the following code -- Carabos Coop. V. | I've only proven that it works, www.carabos.com | I haven't tested it. -- Donald Knuth
Stefan van der Walt wrote:
A "nice C level api" sounds like the definition of oxymoron :) Why would we argue for more C than absolutely necessary in a Python-based library?
Well, it's more often absolutely necessary than you might think. Any common operation on the array should be written in C to avoid the Python function call overhead. We learned that with recarrays. The a.column_name notation (equivalent to a['column_name']) in implemented in a Python method __getattribute__. If you happen to use this in a tight loop, you'll find that __getattribute__ is a bottleneck. Unfortunately, the methods you might want to add or override on an ndarray subclass are going to be the ones that you would want to be efficient. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Matt Knox wrote:
Moving the implementation to the C-level also has its downside. To me, at least, Python code is much more readable and hence easier to maintain.
Is there a way that we can implement only the speed-critical methods in C?
Cheers Stéfan
Implementing the whole thing in C also has the side benefit of the possibility making a nice C level api available to these sub-classes. And I suspect the core numpy developers are comfortable enough with C that maintainability is *probably* not a huge concern here.
But yeah, implementing even just the speed critical parts in C would still be a nice improvement.
Part of the trouble is that sometimes the speed critical parts are how slow Python functions (e.g. __getitem__) are. You want to avoid that function call and the only way to do that (that I know of) is to create the sub-class in C. -Travis
In article <45B13883.7080103@ee.byu.edu>, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Matt Knox wrote:
Moving the implementation to the C-level also has its downside. To me, at least, Python code is much more readable and hence easier to maintain.
Is there a way that we can implement only the speed-critical methods in C?
Cheers Stéfan
Implementing the whole thing in C also has the side benefit of the possibility making a nice C level api available to these sub-classes. And I suspect the core numpy developers are comfortable enough with C that maintainability is *probably* not a huge concern here.
But yeah, implementing even just the speed critical parts in C would still be a nice improvement.
Part of the trouble is that sometimes the speed critical parts are how slow Python functions (e.g. __getitem__) are. You want to avoid that function call and the only way to do that (that I know of) is to create the sub-class in C.
I'm curious why the low level stuff is in C instead of C++? I would have thought that C++ templates and possibly even the standard template library would be a huge win for coding array-type classes. -- Russell
Russell E. Owen wrote:
I'm curious why the low level stuff is in C instead of C++? I would have thought that C++ templates and possibly even the standard template library would be a huge win for coding array-type classes.
I don't know the specifics for numpy, but C++ has huge problems compared to C for code which is meant to be used by other languages: C++ has no ABI standard, loading C++ classes dynamically through dlopen mechanisms is difficult (it basically means rewriting a C api over C++, AFAIK). Also, template are extremely difficult to use in a sensible way, and many advanced tricks using templates are not well supported by often used compilers (looking at boost sources for compiler specific workaround would give you an idea); in numeric codes, template are useful to have one implementation for all the C type availble (float, double, int, etc...), but those are cases which can easily be generated by code generator tools (like autogen; numpy is using its own). There are cases where C++ is useful compared to C++; I don't think numpy is one of them, David
Eric, Travis, Thanks for the words of encouragements :) I'm all in favor of having maskedarray ported to C, but I won't be able to do it myself anytime soon. And I would have to learn C beforehands. Francesc's suggestion of using Pyrex sounds nice, I'll try and see what I can do with that
Moving the implementation to the C-level would be awesome. In particular, __getitem__ and __setitem__ are incredibly slow with masked arrays compared to ndarrays, so using those inside python loops is basically a really bad idea currently. You always have to work with the _data and _mask attributes directly if you are concerned about performance.
Well, yeah, that's expected: __getitem__ tests whether the mask is defined (not nomask) before trying to access the item. If you're using it in a loop, you call the test each time, which is a bad idea. it's indeed far better to call the test beforehand, and process _data and _mask separately A fix would be to force the mask to an array of booleans all the time, but that would slow things down elsewhere,as a lot of functions are artificially accelerated with the nomask trick. A C implementation may render that trick obsolete... Another possibility would be to force the mask as an bool array, and keep an extra flag on top, like hasmask. Hasmask would be False by default, and set to True only if the mask is full of False. That'd require a mask.any() in __array_finalize__, which might still slow things down.
Also, there is a "bug" in Pierre's current implementation I spoke with him about, but currently have no solution for. numpy.add.accumulate doesn't work on arrays from the new maskedarray implementation, but does with the old one.
The fact that it works with 'old' masked arrays doesn't count: they're not real ndarrays. They use the __array__ method to communicate with the rest of numpy, that we shouldn't need.
The problem seems to arise when you over-ride __getitem__ in an ndarray sub-class. See the code below for a demonstration: I'm not sure that's actually the source of the problem.
ufuncs use the __array_wrap__ method to communicate with subclasses. ufuncs methods seem to bypass that. In the meantime, the method of the MA.ufuncs work as expected. Could somebody give me some simple explanation about the behaviour of ufuncs methods, on the Python side ? I'm obviously missing something here...
participants (10)
-
David Cournapeau -
Eric Firing -
Francesc Altet -
Matt Knox -
Pierre GM -
Robert Kern -
Russell E. Owen -
Stefan van der Walt -
Sven Schreiber -
Travis Oliphant