[Numpy-discussion] consensus (was: NA masks in the next numpy release?)
Eric Firing
efiring at hawaii.edu
Sat Oct 29 20:24:06 EDT 2011
On 10/29/2011 12:57 PM, Charles R Harris wrote:
>
>
> On Sat, Oct 29, 2011 at 4:47 PM, Eric Firing <efiring at hawaii.edu
> <mailto:efiring at hawaii.edu>> wrote:
>
> On 10/29/2011 12:02 PM, Olivier Delalleau wrote:
>
> >
> > I haven't been following the discussion closely, but wouldn't it
> be instead:
> > a.mask[0:2] = True?
>
> That would be consistent with numpy.ma <http://numpy.ma> and the
> opposite of Mark's
> implementation.
>
> I can live with either, but I much prefer the numpy.ma
> <http://numpy.ma> version because
> it fits with the use of bit-flags for editing data; set bit 1 if it
> fails check A, set bit 2 if it fails check B, etc. So, if it evaluates
> as True, there is a problem, and the value is masked *out*.
>
> Similarly, in Marks implementation, 7 bits are available for a payload
> to describe what kind of masking is meant. This seems more consistent
> with True as masked (or NA) than with False as masked.
>
>
> I wouldn't rely on the 7 bits yet. Mark left them available to keep open
> possible future use, but didn't implement anything using them yet. If
> memory use turns out to exclude whole sectors of application we will
> have to go to bit masks.
Right; I was only commenting on a subjective sense of internal
consistency. A minor point.
The larger context of all this is how users end up being able to work
with all the different types and specifications of "NA" (in the most
general sense) data:
1) nans
2) numpy.ma
3) masks in the core (Mark's new code)
4) bit patterns
Substantial code now in place--including matplotlib--relies on numpy.ma.
It has some rough edges, it can be slow, it is a pain having it as a
bolted-on module, it may be more complicated than it needs to be, but it
fits a lot of use cases pretty well. There are many users. Everyone
using matplotlib is using it, whether they know it or not.
The ideal from my numpy.ma-user's standpoint would an NA-handling
implementation in the core that would do two things:
(1) allow a gradual transition away from numpy.ma, so that the latter
would become redundant.
(2) allow numpy.ma to be reasonably easily modified to use the in-core
facilities for greater efficiency during the long transition. Implicit
is the hope that someone (most likely not me, although I might be able
to help a bit) would actually perform this modification.
Mark's mission, paid for by Enthought, was not to please numpy.ma users,
but to add NA-handling that would be comfortable for R-users. He chose
to do so with the idea that two possible implementations (masks and
bitpatterns) were desirable, each with strengths and weaknesses, and
that so as to get *something* done in the very short time he had left,
he would start with the mask implementation. We now have the result,
incomplete, but not breaking anything. Additional development (coding
as well as designing) will be needed.
The main question raised by Matthew and Nathaniel is, I think, whether
Mark's code should develop in a direction away from the R-compatibility
model, with the idea that the latter would be handled via a bit-pattern
implementation, some day, when someone codes it; or whether it should
remain as the prototype and first implementation of an API to handle the
R-compatible use case, minimizing any divergence from any eventual
bit-pattern implementation.
The answer to this depends on several questions, including:
1) Who is available to do how much implementation of any of the
possibilities? My reading of Travis's blog and rare posts to this list
suggest that he hopes and expects to be able to free up coding time.
Perhaps he will clarify that soon.
2) What sorts of changes would actually be needed to make the present
implementation good enough for the R use case? Evolutionary, or
revolutionary?
3) What sorts of changes would help with the numpy.ma use case?
Evolutionary, or revolutionary.
4) Given available resources, how can we maximize progress: making numpy
more capable, easier to use, etc.
Unless the answers to questions 2 *and* 3 are "revolutionary", I don't
see the point in pulling Mark's changes out of master. At most, the
documentation might be changed to mark the NA API as "experimental" for
a release or two.
Overall, I think that the differences between the R use case and the ma
use case have been overstated and over-emphasized.
Eric
>
> Chuck
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
More information about the NumPy-Discussion
mailing list