[Numpy-discussion] missing data discussion round 2

Mark Wiebe mwwiebe at gmail.com
Thu Jun 30 10:52:52 EDT 2011


On Wed, Jun 29, 2011 at 1:07 PM, Dag Sverre Seljebotn <
d.s.seljebotn at astro.uio.no> wrote:

> On 06/29/2011 07:38 PM, Mark Wiebe wrote:
> > On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn
> > <d.s.seljebotn at astro.uio.no <mailto:d.s.seljebotn at astro.uio.no>> wrote:
> >
> >     On 06/29/2011 03:45 PM, Matthew Brett wrote:
> >      > Hi,
> >      >
> >      > On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe<mwwiebe at gmail.com
> >     <mailto:mwwiebe at gmail.com>>  wrote:
> >      >> On Tue, Jun 28, 2011 at 5:20 PM, Matthew
> >     Brett<matthew.brett at gmail.com <mailto:matthew.brett at gmail.com>>
> >      >> wrote:
> >      >>>
> >      >>> Hi,
> >      >>>
> >      >>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith<njs at pobox.com
> >     <mailto:njs at pobox.com>>  wrote:
> >      >>> ...
> >      >>>> (You might think, what difference does it make if you *can*
> >     unmask an
> >      >>>> item? Us missing data folks could just ignore this feature.
> But:
> >      >>>> whatever we end up implementing is something that I will have
> to
> >      >>>> explain over and over to different people, most of them not
> >      >>>> particularly sophisticated programmers. And there's just no
> >     sensible
> >      >>>> way to explain this idea that if you store some particular
> >     value, then
> >      >>>> it replaces the old value, but if you store NA, then the old
> >     value is
> >      >>>> still there.
> >      >>>
> >      >>> Ouch - yes.  No question, that is difficult to explain.   Well,
> I
> >      >>> think the explanation might go like this:
> >      >>>
> >      >>> "Ah, yes, well, that's because in fact numpy records missing
> >     values by
> >      >>> using a 'mask'.   So when you say `a[3] = np.NA', what you mean
> is,
> >      >>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
> >      >>>
> >      >>> Is that fair?
> >      >>
> >      >> My favorite way of explaining it would be to have a grid of
> >     numbers written
> >      >> on paper, then have several cardboards with holes poked in them
> >     in different
> >      >> configurations. Placing these cardboard masks in front of the
> >     grid would
> >      >> show different sets of non-missing data, without affecting the
> >     values stored
> >      >> on the paper behind them.
> >      >
> >      > Right - but here of course you are trying to explain the mask, and
> >      > this is Nathaniel's point, that in order to explain NAs, you have
> to
> >      > explain masks, and so, even at a basic level, the fusion of the
> two
> >      > ideas is obvious, and already confusing.  I mean this:
> >      >
> >      > a[3] = np.NA
> >      >
> >      > "Oh, so you just set the a[3] value to have some missing value
> code?"
> >      >
> >      > "Ah - no - in fact what I did was set a associated mask in
> position
> >      > a[3] so that you can't any longer see the previous value of a[3]"
> >      >
> >      > "Huh.  You mean I have a mask for every single value in order to
> be
> >      > able to blank out a[3]?  It looks like an assignment.  I mean, it
> >      > looks just like a[3] = 4.  But I guess it isn't?"
> >      >
> >      > "Er..."
> >      >
> >      > I think Nathaniel's point is a very good one - these are separate
> >      > ideas, np.NA and np.IGNORE, and a joint implementation is bound to
> >      > draw them together in the mind of the user.    Apart from anything
> >      > else, the user has to know that, if they want a single NA value in
> an
> >      > array, they have to add a mask size array.shape in bytes.  They
> have
> >      > to know then, that NA is implemented by masking, and then the 'NA
> for
> >      > free by adding masking' idea breaks down and starts to feel like a
> >      > kludge.
> >      >
> >      > The counter argument is of course that, in time, the
> >     implementation of
> >      > NA with masking will seem as obvious and intuitive, as, say,
> >      > broadcasting, and that we are just reacting from lack of
> experience
> >      > with the new API.
> >
> >     However, no matter how used we get to this, people coming from almost
> >     any other tool (in particular R) will keep think it is
> >     counter-intuitive. Why set up a major semantic incompatability that
> >     people then have to overcome in order to start using NumPy.
> >
> >
> > I'm not aware of a semantic incompatibility. I believe R doesn't support
> > views like NumPy does, so the things you have to do to see masking
> > semantics aren't even possible in R.
>
> Well, whether the same feature is possible or not in R is irrelevant to
> whether a semantic incompatability would exist.
>
> Views themselves are a *major* semantic incompatability, and are highly
> confusing at first to MATLAB/Fortran/R people. However they have major
> advantages outweighing the disadvantage of having to caution new users.
>
> But there's simply no precedence anywhere for an assignment that doesn't
> erase the old value for a particular input value, and the advantages
> seem pretty minor (well, I think it is ugly in its own right, but that
> is besides the point...)
>

I disagree that there's no precedent, but maybe there isn't something which
is exactly the same as my design. The whole "actual real literal assignment"
thought process leads to considerations of little gnomes writing numbers on
pieces of paper inside your computer...

-Mark


>
> Dag Sverre
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110630/86bef041/attachment.html>


More information about the NumPy-Discussion mailing list