[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Wed Jun 29 14:04:18 EDT 2011

On Wed, Jun 29, 2011 at 11:53 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:

> On Tue, Jun 28, 2011 at 7:34 AM, Lluís <xscript at gmx.net> wrote:
>
>> Mark Wiebe writes:
>> > The design that's forming is a combination of:
>>
>> > * Solve the missing data problem
>> > * My ideas of what a good solution looks like:
>> >    * applies to all NumPy dtypes in a fully general way
>> >    * high-performance, low overhead where possible
>> >    * makes the C-level implementation of NumPy nicer to work with, not
>> harder
>> >    * easy to use from Python for unskilled programmers
>> >    * easy to use more powerful functionality from Python for skilled
>> programmers
>> >    * satisfies all or most of the needs of the many users of arrays with
>> a "missing data" aspect to them
>>
>> I would add here an efficient mechanism to reinterpret exising data with
>> different missing information (no copies of the backing array).
>>
>> Although I'm not sure whether this requires first-class citizenship or
>> not.
>>
>
> I'm calling this idea "masking semantics" generally.
>
>  > * All the feedback I'm getting from discussions on the list
>> [...]
>> > I've updated a section "Parameterized Data Type With NA Signal Values"
>> > in the NEP with an idea for now an NA bit pattern approach could
>> > coexist and work together with the mask-based approach. I think I've
>> > solved some of the generality and implementation obstacles, it would
>> > be great to get some feedback on that.
>>
>> Some (obvious) thoughts about it:
>>
>> * Trivial to store, as the missing property is encoded in the value
>>  itself.
>> * Third-party (non-Python) code needs some interface to interpret these
>>  without having to know the implementation details (although the
>>  interface is rather trivial).
>> * Data marked as missing loses its original value.
>> * Reinterpreting the same data (memory buffer) with different missing
>>  information requires either memory copies or separate mask arrays (see
>>  above)
>>
>> So, while it (data types with NA signal values) has its advantages on a
>> simpler interaction with 3rd party code and during long-term storage,
>> masks will still be needed.
>>
>> I think that deciding on the value of NA signal values boils down to
>> this question: should 3rd party code be able to interpret missing data
>> information stored in the separate mask array?
>>
>
> I'm tossing around some variations of ideas using the iterator to provide a
> buffered mask-based interface that works uniformly with both masked arrays
> and NA dtypes. This way 3rd party C code only needs to implement one missing
> data mechanism to fully support both of NumPy's missing data mechanisms.
>
>
;) Also, it avoids a horrible mass of code.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110629/a4273b20/attachment.html>