[Numpy-discussion] consensus (was: NA masks in the next numpy release?)

Sat Oct 29 20:24:06 EDT 2011

On 10/29/2011 12:57 PM, Charles R Harris wrote:
>
>
> On Sat, Oct 29, 2011 at 4:47 PM, Eric Firing <efiring at hawaii.edu
> <mailto:efiring at hawaii.edu>> wrote:
>
>     On 10/29/2011 12:02 PM, Olivier Delalleau wrote:
>
>      >
>      > I haven't been following the discussion closely, but wouldn't it
>     be instead:
>      > a.mask[0:2] = True?
>
>     That would be consistent with numpy.ma <http://numpy.ma> and the
>     opposite of Mark's
>     implementation.
>
>     I can live with either, but I much prefer the numpy.ma
>     <http://numpy.ma> version because
>     it fits with the use of bit-flags for editing data; set bit 1 if it
>     fails check A, set bit 2 if it fails check B, etc.  So, if it evaluates
>     as True, there is a problem, and the value is masked *out*.
>
>     Similarly, in Marks implementation, 7 bits are available for a payload
>     to describe what kind of masking is meant.  This seems more consistent
>     with True as masked (or NA) than with False as masked.
>
>
> I wouldn't rely on the 7 bits yet. Mark left them available to keep open
> possible future use, but didn't implement anything using them yet. If
> memory use turns out to exclude whole sectors of application we will
> have to go to bit masks.

Right; I was only commenting on a subjective sense of internal 
consistency.  A minor point.

The larger context of all this is how users end up being able to work 
with all the different types and specifications of "NA" (in the most 
general sense) data:

1) nans
2) numpy.ma
3) masks in the core (Mark's new code)
4) bit patterns

Substantial code now in place--including matplotlib--relies on numpy.ma. 
  It has some rough edges, it can be slow, it is a pain having it as a 
bolted-on module, it may be more complicated than it needs to be, but it 
fits a lot of use cases pretty well.  There are many users.  Everyone 
using matplotlib is using it, whether they know it or not.

The ideal from my numpy.ma-user's standpoint would an NA-handling 
implementation in the core that would do two things:
(1) allow a gradual transition away from numpy.ma, so that the latter 
would become redundant.
(2) allow numpy.ma to be reasonably easily modified to use the in-core 
facilities for greater efficiency during the long transition.  Implicit 
is the hope that someone (most likely not me, although I might be able 
to help a bit) would actually perform this modification.

Mark's mission, paid for by Enthought, was not to please numpy.ma users, 
but to add NA-handling that would be comfortable for R-users.  He chose 
to do so with the idea that two possible implementations (masks and 
bitpatterns) were desirable, each with strengths and weaknesses, and 
that so as to get *something* done in the very short time he had left, 
he would start with the mask implementation.  We now have the result, 
incomplete, but not breaking anything.  Additional development (coding 
as well as designing) will be needed.

The main question raised by Matthew and Nathaniel is, I think, whether 
Mark's code should develop in a direction away from the R-compatibility 
model, with the idea that the latter would be handled via a bit-pattern 
implementation, some day, when someone codes it; or whether it should 
remain as the prototype and first implementation of an API to handle the 
R-compatible use case, minimizing any divergence from any eventual 
bit-pattern implementation.

The answer to this depends on several questions, including:

1) Who is available to do how much implementation of any of the 
possibilities?  My reading of Travis's blog and rare posts to this list 
suggest that he hopes and expects to be able to free up coding time. 
Perhaps he will clarify that soon.

2) What sorts of changes would actually be needed to make the present 
implementation good enough for the R use case?  Evolutionary, or 
revolutionary?

3) What sorts of changes would help with the numpy.ma use case? 
Evolutionary, or revolutionary.

4) Given available resources, how can we maximize progress: making numpy 
more capable, easier to use, etc.

Unless the answers to questions 2 *and* 3 are "revolutionary", I don't 
see the point in pulling Mark's changes out of master.  At most, the 
documentation might be changed to mark the NA API as "experimental" for 
a release or two.

Overall, I think that the differences between the R use case and the ma 
use case have been overstated and over-emphasized.

Eric

>
> Chuck
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion