[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 16:38:26 EDT 2011

Mark Wiebe writes:

>             It's should also be possible to accomplish a general
>             solution at the dtype level. We could have a 'dtype
>             factory' used like:  np.zeros(10, dtype=np.maybe(float))
>             where np.maybe(x) returns a new dtype whose storage size
>             is x.itemsize + 1, where the extra byte is used to store
>             missingness information.  (There might be some annoying
>             alignment issues to deal with.) Then for each ufunc we
>             define a handler for the maybe dtype (or add a
>             special-case to the ufunc dispatch machinery) that checks
>             the missingness value and then dispatches to the ordinary
>             ufunc handler for the wrapped dtype.

>         The 'dtype factory' idea builds on the way I've structured
>         datetime as a parameterized type, but the thing that kills it
>         for me is the alignment problems of 'x.itemsize + 1'. Having
>         the mask in a separate memory block is a lot better than
>         having to store 16 bytes for an 8-byte int to preserve the
>         alignment.

>     Yes, but that assumes it is appended to the existing types in the
>     dtype individually instead of the dtype as a whole. The dtype with
>     mask could just indicate a shadow array, an alpha channel if you
>     will, that is essentially what you are already doing but just
>     probide a different place to track it.

> This would seem to change the definition of a dtype - currently it
> represents a contiguous block of memory. It doesn't need to use all of
> that memory, but the dtype conceptually owns it. I kind of like it
> that way, where the whole strides idea with data being all over memory
> space belonging to ndarray, not dtype.

I don't havy any knowledge on the numpy or ma internals, so this might
well be nonsense.

Increasing the dtype item size would certainly decrease performance when
using big structures, as it will require higher memory bandwidth.

Why not use structured arrays? (assuming each struct element has indeed
its own buffer, otherwise it's the same as having a "bigger" dtype) Then
you can have some "blessed" struct elements, like the mask, which
influence on how to print the array or how other struct elements must be
operated.

Besides, using "blessed" struct elements falls in line with the recent
"_ufunc_wrapper_" proposal.

Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth