<br><br><div class="gmail_quote">On Sat, Jun 25, 2011 at 8:44 AM, Wes McKinney <span dir="ltr"><<a href="mailto:wesmckinn@gmail.com">wesmckinn@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

On Sat, Jun 25, 2011 at 10:25 AM, Charles R Harris<br>

<div><div></div><div class="h5"><<a href="mailto:charlesr.harris@gmail.com">charlesr.harris@gmail.com</a>> wrote:<br>

><br>

><br>

> On Sat, Jun 25, 2011 at 8:14 AM, Wes McKinney <<a href="mailto:wesmckinn@gmail.com">wesmckinn@gmail.com</a>> wrote:<br>

>><br>

>> On Sat, Jun 25, 2011 at 12:42 AM, Charles R Harris<br>

>> <<a href="mailto:charlesr.harris@gmail.com">charlesr.harris@gmail.com</a>> wrote:<br>

>> ><br>

>> ><br>

>> > On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney <<a href="mailto:wesmckinn@gmail.com">wesmckinn@gmail.com</a>><br>

>> > wrote:<br>

>> >><br>

>> >> On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith <<a href="mailto:njs@pobox.com">njs@pobox.com</a>><br>

>> >> wrote:<br>

>> >> > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root <<a href="mailto:ben.root@ou.edu">ben.root@ou.edu</a>><br>

>> >> > wrote:<br>

>> >> >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith <<a href="mailto:njs@pobox.com">njs@pobox.com</a>><br>

>> >> >> wrote:<br>

>> >> >>> This is a situation where I would just... use an array and a mask,<br>

>> >> >>> rather than a masked array. Then lots of things -- changing fill<br>

>> >> >>> values, temporarily masking/unmasking things, etc. -- come from<br>

>> >> >>> free,<br>

>> >> >>> just from knowing how arrays and boolean indexing work?<br>

>> >> >><br>

>> >> >> With a masked array, it is "for free".  Why re-invent the wheel?  It<br>

>> >> >> has<br>

>> >> >> already been done for me.<br>

>> >> ><br>

>> >> > But it's not for free at all. It's an additional concept that has to<br>

>> >> > be maintained, documented, and learned (with the last cost, which is<br>

>> >> > multiplied by the number of users, being by far the greatest). It's<br>

>> >> > not reinventing the wheel, it's saying hey, I have wheels and axles,<br>

>> >> > but what I really need the library to provide is a wheel+axle<br>

>> >> > assembly!<br>

>> >><br>

>> >> You're communicating my argument better than I am.<br>

>> >><br>

>> >> >>> Do we really get much advantage by building all these complex<br>

>> >> >>> operations in? I worry that we're trying to anticipate and write<br>

>> >> >>> code<br>

>> >> >>> for every situation that users find themselves in, instead of just<br>

>> >> >>> giving them some simple, orthogonal tools.<br>

>> >> >>><br>

>> >> >><br>

>> >> >> This is the danger, and which is why I advocate retaining the<br>

>> >> >> MaskedArray<br>

>> >> >> type that would provide the high-level "intelligent" operations,<br>

>> >> >> meanwhile<br>

>> >> >> having in the core the basic data structures for  pairing a mask<br>

>> >> >> with<br>

>> >> >> an<br>

>> >> >> array, and to recognize a special np.NA value that would act upon<br>

>> >> >> the<br>

>> >> >> mask<br>

>> >> >> rather than the underlying data.  Users would get very basic<br>

>> >> >> functionality,<br>

>> >> >> while the MaskedArray would continue to provide the interface that<br>

>> >> >> we<br>

>> >> >> are<br>

>> >> >> used to.<br>

>> >> ><br>

>> >> > The interface as described is quite different... in particular, all<br>

>> >> > aggregate operations would change their behavior.<br>

>> >> ><br>

>> >> >>> As a corollary, I worry that learning and keeping track of how<br>

>> >> >>> masked<br>

>> >> >>> arrays work is more hassle than just ignoring them and writing the<br>

>> >> >>> necessary code by hand as needed. Certainly I can imagine that *if<br>

>> >> >>> the<br>

>> >> >>> mask is a property of the data* then it's useful to have tools to<br>

>> >> >>> keep<br>

>> >> >>> it aligned with the data through indexing and such. But some of<br>

>> >> >>> these<br>

>> >> >>> other things are quicker to reimplement than to look up the docs<br>

>> >> >>> for,<br>

>> >> >>> and the reimplementation is easier to read, at least for me...<br>

>> >> >><br>

>> >> >> What you are advocating is similar to the "tried-n-true" coding<br>

>> >> >> practice of<br>

>> >> >> Matlab users of using NaNs.  You will hear from Matlab programmers<br>

>> >> >> about how<br>

>> >> >> it is the greatest idea since sliced bread (and I was one of them).<br>

>> >> >> Then I<br>

>> >> >> was introduced to Numpy, and I while I do sometimes still do the NaN<br>

>> >> >> approach, I realized that the masked array is a "better" way.<br>

>> >> ><br>

>> >> > Hey, no need to go around calling people Matlab programmers, you<br>

>> >> > might<br>

>> >> > hurt someone's feelings.<br>

>> >> ><br>

>> >> > But seriously, my argument is that every abstraction and new concept<br>

>> >> > has a cost, and I'm dubious that the full masked array abstraction<br>

>> >> > carries its weight and justifies this cost, because it's highly<br>

>> >> > redundant with existing abstractions. That has nothing to do with how<br>

>> >> > tried-and-true anything is.<br>

>> >><br>

>> >> +1. I think I will personally only be happy if "masked array" can be<br>

>> >> implemented while incurring near-zero cost from the end user<br>

>> >> perspective. If what we end up with is a faster implementation of<br>

>> >> <a href="http://numpy.ma" target="_blank">numpy.ma</a> in C I'm probably going to keep on using NaN... That's why<br>

>> >> I'm entirely insistent that whatever design be dogfooded on non-expert<br>

>> >> users. If it's very much harder / trickier / nuanced than R, you will<br>

>> >> have failed.<br>

>> >><br>

>> ><br>

>> > This sounds unduly pessimistic to me. It's one thing to suggest<br>

>> > different<br>

>> > approaches, another to cry doom and threaten to go eat worms. And all<br>

>> > before<br>

>> > the code is written, benchmarks run, or trial made of the usefulness of<br>

>> > the<br>

>> > approach. Let us see how things look as they get worked out. Mark has a<br>

>> > good<br>

>> > track record for innovative tools and I'm rather curious myself to see<br>

>> > what<br>

>> > the result is.<br>

>> ><br>

>> > Chuck<br>

>> ><br>

>> ><br>

>> > _______________________________________________<br>

>> > NumPy-Discussion mailing list<br>

>> > <a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>

>> > <a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>

>> ><br>

>> ><br>

>><br>

>> I hope you're right. So far it seems that anyone who has spent real<br>

>> time with R (e.g. myself, Nathaniel) has expressed serious concerns<br>

>> about the masked approach. And we got into this discussion at the Data<br>

>> Array summit in Austin last month because we're trying to make Python<br>

>> more competitive with R viz statistical and financial applications.<br>

>> I'm just trying to be (R)ealistic =P Remember that I very earnestly am<br>

>> doing everything I can these days to make scientific Python more<br>

>> successful in finance and statistics. One big difference with R's<br>

>> approach is that we care more about performance the the R community<br>

>> does. So maybe having special NA values will be prohibitive for that<br>

>> reason.<br>

>><br>

>> Mark indeed has a fantastic track record and I've been extremely<br>

>> impressed with his NumPy work, so I've no doubt he'll do a good job. I<br>

>> just hope that you don't push aside my input-- my opinions are formed<br>

>> entirely based on my domain experience.<br>

>><br>

><br>

> I think what we really need to see are the use cases and work flow. The ones<br>

> that hadn't occurred to me before were memory mapped files and data stored<br>

> on disk in general. I think we may need some standard format for masked data<br>

> on disk if we don't go the NA value route.<br>

><br>

> Chuck<br>

><br>

><br>

> _______________________________________________<br>

> NumPy-Discussion mailing list<br>

> <a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>

> <a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>

><br>

><br>

<br>

</div></div>Here are some things I can think of that would be affected by any changes here<br>

<br>

1) Right now users of pandas can type pandas.isnull(series[5]) and<br>

that will yield True if the value is NA for any dtype. This might be<br>

hard to support in the masked regime<br>

2) Functions like {Series, DataFrame}.fillna would hopefully look just<br>

like this:<br>

<br>

# value is 0 or some other value to fill<br>

new_series = self.copy()<br>

new_series[isnull(new_series)] = value<br>

<br>

Keep in mind that people will write custom NA handling logic. So they might do:<br>

<br>

series[isnull(other_series) & isnull(other_series2)] = val<br>

<br>

3) Nulling / NA-ing out data is very common<br>

<br>

# null out this data up to and including date1 in these three columns<br>

frame.ix[:date1, [col1, col2, col3]] = NaN<br>

<br>

# But this should work fine too<br>

frame.ix[:date1, [col1, col2, col3]] = 0<br>

<br>

I'll try to think of some others. The main thing is that the NA value<br>

is very easy to think about and fits in naturally with how people (at<br>

least statistical / financial users) think about and work with data.<br>

If you have to say "I have to set these mask locations to True" it<br>

introduces additional mental effort compared with "I'll just set these<br>

values to NA"<br>

<div><div></div><div class="h5">_</div></div></blockquote><div><br>You should take a look at the current <a href="http://tinyurl.com/5ueamks">NEP</a>. You don't have to deal with the mask, you just need to assign np.NA to the array location, like in R (as I understand it). The masks are for the most part transparent to the user.<br>

<br>Chuck <br></div></div>