[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Matthew Brett matthew.brett at gmail.com
Fri Jun 24 07:59:49 EDT 2011


Hi,

On Fri, Jun 24, 2011 at 2:32 AM, Nathaniel Smith <njs at pobox.com> wrote:
...
> If we think that the memory overhead for floating point types is too
> high, it would be easy to add a special case where maybe(float) used a
> distinguished NaN instead of a separate boolean. The extra complexity
> would be isolated to the 'maybe' dtype's inner loop functions, and
> transparent to the Python level. (Implementing a similar optimization
> for the masking approach would be really nasty.) This would change the
> overhead comparison to 0% versus 12.5% in favor of the dtype approach.

Can I take this chance to ask Mark a bit more about the problems he
sees for the dtypes with missing values?   That is have a

np.float64_with_missing
np.int32_with_missing

type dtypes.   I see in your NEP you say 'The trouble with this
approach is that it requires a large amount of special case code in
each data type, and writing a new data type supporting missing data
requires defining a mechanism for a special signal value which may not
be possible in general.'

Just to be clear, you are saying that that, for each dtype, there
needs to be some code doing:

missing_value = dtype.missing_value

then, in loops:

if val[here] == missing_value:
    do_something()

and the fact that 'missing_value' could be any type would make the
code more complicated than the current case where the mask is always
bools or something?

Nathaniel's point about reduction in storage needed for the mask to 0
is surely significant if we want numpy to be the best choice for big
data.

You mention that it would be good to allow masking for any new dtype -
is that a practical problem?  I mean, how many people will in fact
have the combination of a) need of masking b) need of custom dtype,
and c) lack of time or expertise to implement masking for that type?

Thanks a lot for the proposal and the discussion,

Matthew



More information about the NumPy-Discussion mailing list