[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 12:45:04 EDT 2011

On Fri, Jun 24, 2011 at 6:59 AM, Matthew Brett <matthew.brett at gmail.com>wrote:

> Hi,
>
> On Fri, Jun 24, 2011 at 2:32 AM, Nathaniel Smith <njs at pobox.com> wrote:
> ...
> > If we think that the memory overhead for floating point types is too
> > high, it would be easy to add a special case where maybe(float) used a
> > distinguished NaN instead of a separate boolean. The extra complexity
> > would be isolated to the 'maybe' dtype's inner loop functions, and
> > transparent to the Python level. (Implementing a similar optimization
> > for the masking approach would be really nasty.) This would change the
> > overhead comparison to 0% versus 12.5% in favor of the dtype approach.
>
> Can I take this chance to ask Mark a bit more about the problems he
> sees for the dtypes with missing values?   That is have a
>
> np.float64_with_missing
> np.int32_with_missing
>
> type dtypes.   I see in your NEP you say 'The trouble with this
> approach is that it requires a large amount of special case code in
> each data type, and writing a new data type supporting missing data
> requires defining a mechanism for a special signal value which may not
> be possible in general.'
>
> Just to be clear, you are saying that that, for each dtype, there
> needs to be some code doing:
>
> missing_value = dtype.missing_value
>
> then, in loops:
>
> if val[here] == missing_value:
>    do_something()
>
> and the fact that 'missing_value' could be any type would make the
> code more complicated than the current case where the mask is always
> bools or something?
>

I'm referring to the underlying C implementations of the dtypes and any
additional custom dtypes that people create. With the masked approach, you
implement a new custom data type in C, and it automatically works with
missing data. With the custom dtype approach, you have to do a lot more
error-prone work to handle the special values in all the ufuncs.

>
> Nathaniel's point about reduction in storage needed for the mask to 0
> is surely significant if we want numpy to be the best choice for big
> data.
>

The mask will only be there if it's explicitly requested, so it's not taking
away from NumPy in any way. If someone is dealing with data that large, I
likely wouldn't always be with the particular NA conventions NumPy chooses
for the various primitive data types, so that approach isn't a clear win
either.

You mention that it would be good to allow masking for any new dtype -
> is that a practical problem?  I mean, how many people will in fact
> have the combination of a) need of masking b) need of custom dtype,
> and c) lack of time or expertise to implement masking for that type?
>

Well, the people who need that right now will probably look at the NumPy C
source code and give up immediately. I'd rather push the system in a
direction of it being easier for those people than harder. It should be
possible to define a C++ data type class with overloaded operators, then say
NPY_EXPOSE_DTYPE(MyCustomClass), which would wrap those overloaded operators
with NumPy conventions. If this were done, I suspect many people would
create custom data types.

-Mark

>
> Thanks a lot for the proposal and the discussion,
>
> Matthew
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110624/1b143ba6/attachment.html>