[Numpy-discussion] Concepts for masked/missing data

Sun Jun 26 01:25:40 EDT 2011

I haven't commented yet on the mailing list because of time pressures although I have spoken to Mark as often as I can --- and have encouraged him to pursue his ideas and discuss them with the community.  The Numeric Python discussion list has a long history of great dialogue to try and bring out as many perspectives as possible as we wrestle with improving the code base.   It is very encouraging to see that tradition continuing.  

Because Enthought was mentioned earlier in this thread, I would like to try and clarify a few things about my employer and the company's interest.    Enthought has been very interested in the development of the NumPy and SciPy stack (and the broader SciPy community) for sometime.   With its limited resources, Enthought helped significantly to form the SciPy community and continues to sponsor it as much as it can.   Many developers who work at Enthought (including me) have personal interest in the NumPy / SciPy community and codebase that go beyond Enthought's ability to invest directly as well.   

While Enthought has limited resources to invest directly in pursuing the goal, Enthought is very interested in improving Python's use as a data analysis environment.    Because of that interest,  Enthought sponsored a "data-array" summit in May.   There is an inScight podcast that summarizes some of the event that you can listen to at http://inscight.org/2011/05/18/episode_13/.    The purpose of this event was to bring a few people together who have been working on different aspects of the problem (particularly around the labelled array, or data array problem).    We also wanted to jump start the activity of our interns and make sure that some of the use cases we have seen during the past several years while working on client projects had some light. 

The event was successful in that it generated *a lot* of ideas.   Some of these ideas were summarized in notes that are linked to at this convore thread: https://convore.com/python-scientific-computing/data-array-in-numpy/     One of the major ideas that emerged during the discussion is that NumPy needs to be able to handle missing data in a more integrated way (i.e. there need to be functions that do the "right" thing in the face of missing data).   One approach that was suggested during some of the discussion was that one way to handle missing data would be to introducing special nadtypes.   

Mark is one of 2 interns that we have this summer who are tasked at a high level with taking what was learned at the summit and implementing critical pieces as their skills and interests allow.    I have been talking with them individually to map out specific work targets for the summer.    Christopher Jordan-Squires is one of our interns who is studying to get a PhD in Mathematics at the Univ. of Washington.   He has a strong interest in statistics and a desire to make Python as easy to use as R for certain statistical work flows.    Mark Wiebe is known on this list because of his recent success at working on the NumPy code base.  As a result of that success, Mark is working on making improvements to NumPy that are seen as most critical to solving some of the same problems we keep seeing in our projects (labeled arrays being one of them).   We are also very interested in the Pandas project as it brings a data-structure like the successful DataFrame in R to the Python space (and it helps solve some of the problems our clients are seeing).   It would be good to make sure that core functionality that Pandas needs is available in NumPy where appropriate. 

The date-time work that Mark did was the first "low-hanging" fruit that needed to be finished.   The second project that Mark is involved with is creating an approach for missing data in NumPy.   I suggested the missing data dtypes (in part because Mark had expressed some concerns about the way dtypes are handled in NumPy and I would love for that mechanism for user-defined data-types and the whole data-type infrastructure to be improved as needed.)   Mark spent some time thinking about it and felt more comfortable with the masked array solution and that is where we are now. 

Enthought's main interest remains in seeing how much of the data array can and should be moved into low-level NumPy as well as the implementation of functionality (wherever it may live) to make data analysis easier and more productive in Python.    Again, though, this is something Enthought as a company can only invest limited resources in, and we want to make sure that Mark spends the time that we are sponsoring doing work that is seen as valuable by the community but more importantly matching our own internal needs. 

I will post a follow-on message that provides my current views on the subject of missing data and masked arrays.    

-Travis

On Jun 25, 2011, at 2:09 PM, Benjamin Root wrote:

> 
> 
> On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing <efiring at hawaii.edu> wrote:
> > On 06/25/2011 07:05 AM, Nathaniel Smith wrote:
> >> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett<matthew.brett at gmail.com>  wrote:
> >>> To clarify, you're proposing for:
> >>>
> >>> a = np.sum(np.array([np.NA, np.NA])
> >>>
> >>> 1) ->  np.NA
> >>> 2) ->  0.0
> >>
> >> Yes -- and in R you get actually do get NA, while in numpy.ma you
> >> actually do get 0. I don't think this is a coincidence; I think it's
> >
> > No, you don't:
> >
> > In [2]: np.ma.array([2, 4], mask=[True, True]).sum()
> > Out[2]: masked
> >
> > In [4]: np.sum(np.ma.array([2, 4], mask=[True, True]))
> > Out[4]: masked
> 
> Huh. So in numpy.ma, sum([10, NA]) and sum([10]) are the same, but
> sum([NA]) and sum([]) are different? Sounds to me like you should file
> a bug on numpy.ma...
> 
> Actually, no... I should have tested this before replying earlier:
> 
> >>> a = np.ma.array([2, 4], mask=[True, True])
> >>> a
> masked_array(data = [-- --],
>              mask = [ True  True],
>        fill_value = 999999)
> 
> >>> a.sum()
> masked
> >>> a = np.ma.array([], mask=[])
> >>> a
> >>> a
> masked_array(data = [],
>              mask = [],
>        fill_value = 1e+20)
> >>> a.sum()
> masked
> 
> They are the same.
> 
> 
> Anyway, the general point is that in R, NA's propagate, and in
> numpy.ma, masked values are ignored (except, apparently, if all values
> are masked). Here, I actually checked these:
> 
> Python: np.ma.array([2, 4], mask=[True, False]).sum() -> 4
> R: sum(c(NA, 4)) -> NA
> 
> 
> If you want NaN behavior, then use NaNs.  If you want masked behavior, then use masks.
> 
> Ben Root
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

---
Travis Oliphant
Enthought, Inc.
oliphant at enthought.com
1-512-536-1057
http://www.enthought.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110626/d7a8c0b6/attachment.html>