[Numpy-discussion] Concepts for masked/missing data

Sat Jun 25 14:32:54 EDT 2011

On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith <njs at pobox.com> wrote:

> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett <matthew.brett at gmail.com>
> wrote:
> > So far I see the difference between 1) and 2) being that you cannot
> > unmask.  So, if you didn't even know you could unmask data, then it
> > would not matter that 1) was being implemented by masks?
>
> I guess that is a difference, but I'm trying to get at something more
> fundamental -- not just what operations are allowed, but what
> operations people *expect* to be allowed.

That is quite a trickier problem.

>
> Here's another possible difference -- in (1), intuitively, missingness
> is a property of the data, so the logical place to put information
> about whether you can expect missing values is in the dtype, and to
> enable missing values you need to make a new array with a new dtype.
> (If we use a mask-based implementation, then
> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
> to skip making a copy of the data -- I'm talking ONLY about the
> interface here, not whether missing data has a different storage
> format from non-missing data.)
>
> In (2), the whole point is to use different masks with the same data,
> so I'd argue masking should be a property of the array object rather
> than the dtype, and the interface should logically allow masks to be
> created, modified, and destroyed in place.
>
>
I can agree with this distinction.  However, if "missingness" is an
intrinsic property of the data, then shouldn't users be implementing their
own dtype tailored to the data they are using?  In other words, how far does
the core of NumPy need to go to address this issue?  And how far would be
"too much"?

> They're both internally consistent, but I think we might have to make
> a decision and stick to it.
>
>
Of course.  I think that Mark is having a very inspired idea of giving the R
audience what they want (np.NA), while simultaneously making the use of
masked arrays even easier (which I can certainly appreciate).

> > I agree it's good to separate the API from the implementation.   I
> > think the implementation is also important because I care about memory
> > and possibly speed.  But, that is a separate problem from the API...
>
> Yes, absolutely memory and speed are important. But a really fast
> solution to the wrong problem isn't so useful either :-).
>
>
The one thing I have always loved about Python (and NumPy) is that "it
respects the developer's time".  I come from a C++ background where I found
C++ to be powerful, but tedious.  I went to Matlab because it was just
straight-up easier to code math and display graphs.  (If anybody here ever
used GrADS, then you know how badly I would want a language that respected
my time).  However, even Matlab couldn't fully respect my time as I usually
kept wasting it trying to get various pieces working.  Python came along,
and while it didn't always match the speed of some of my matlab programs, it
was "fast enough".

I will put out a little disclaimer.  I once had to use S+ for a class.  To
be honest, it was the worst programming experience in my life.  This
experience may be coloring my perception of R's approach to handling missing
data.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110625/6c334718/attachment.html>