[Numpy-discussion] NA/Missing Data Conference Call Summary

josef.pktd at gmail.com josef.pktd at gmail.com
Wed Jul 6 16:47:36 EDT 2011


On Wed, Jul 6, 2011 at 4:38 PM,  <josef.pktd at gmail.com> wrote:
> On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
> <cjordan1 at uw.edu> wrote:
>>
>>
>> On Wed, Jul 6, 2011 at 1:08 PM, <josef.pktd at gmail.com> wrote:
>>>
>>> On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
>>> <cjordan1 at uw.edu> wrote:
>>> >
>>> >
>>> > On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
>>> > <Chris.Barker at noaa.gov>
>>> > wrote:
>>> >>
>>> >> Christopher Jordan-Squire wrote:
>>> >> > If we follow those rules for IGNORE for all computations, we
>>> >> > sometimes
>>> >> > get some weird output. For example:
>>> >> > [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
>>> >> > multiply and not * with broadcasting.) Or should that sort of
>>> >> > operation
>>> >> > through an error?
>>> >>
>>> >> That should throw an error -- matrix computation is heavily influenced
>>> >> by the shape and size of matrices, so I think IGNORES really don't make
>>> >> sense there.
>>> >>
>>> >>
>>> >
>>> > If the IGNORES don't make sense in basic numpy computations then I'm
>>> > kinda
>>> > confused why they'd be included at the numpy core level.
>>> >
>>> >>
>>> >> Nathaniel Smith wrote:
>>> >> > It's exactly this transparency that worries Matthew and me -- we feel
>>> >> > that the alterNEP preserves it, and the NEP attempts to erase it. In
>>> >> > the NEP, there are two totally different underlying data structures,
>>> >> > but this difference is blurred at the Python level. The idea is that
>>> >> > you shouldn't have to think about which you have, but if you work
>>> >> > with
>>> >> > C/Fortran, then of course you do have to be constantly aware of the
>>> >> > underlying implementation anyway.
>>> >>
>>> >> I don't think this bothers me -- I think it's analogous to things in
>>> >> numpy like Fortran order and non-contiguous arrays -- you can ignore
>>> >> all
>>> >> that when working in pure python when performance isn't critical, but
>>> >> you need a deeper understanding if you want to work with the data in C
>>> >> or Fortran or to tune performance in python.
>>> >>
>>> >> So as long as there is an API to query and control how things work, I
>>> >> like that it's hidden from simple python code.
>>> >>
>>> >> -Chris
>>> >>
>>> >>
>>> >
>>> > I'm similarly not too concerned about it. Performance seems finicky when
>>> > you're dealing with missing data, since a lot of arrays will likely have
>>> > to
>>> > be copied over to other arrays containing only complete data before
>>> > being
>>> > handed over to BLAS.
>>>
>>> Unless you know the neutral value for the computation or you just want
>>> to do a forward_fill in time series, and you have to ask the user not
>>> to give you an unmutable array with NAs if they don't want extra
>>> copies.
>>>
>>> Josef
>>>
>>
>> Mean value replacement, or more generally single scalar value replacement,
>> is generally not a good idea. It biases downward your standard error
>> estimates if you use mean replacement, and it will bias both if you use
>> anything other than mean replacement. The bias is gets worse with more
>> missing data. So it's worst in the precisely the cases where you'd want to
>> fill in the data the most. (Though I admit I'm not too familiar with time
>> series, so maybe this doesn't apply. But it's true as a general principle in
>> statistics.) I'm not sure why we'd want to make this use case easier.

Another qualification on this (I cannot help it).
I think this only applies if you use a prefabricated no-missing-values
algorithm. If I write it myself, I can do the proper correction for
the reduced number of observations. (similar to the case when we
ignore correlated information and use statistics based on uncorrelated
observations which also overestimate the amount of information we have
available.)

Josef

>
> We just discussed a use case for pandas on the statsmodels mailing
> list, minute data of stock quotes (prices), if the quote is NA then
> fill it with the last price quote. If it would be necessary for memory
> usage and performance, this can be handled efficiently and with
> minimal copying.
>
> If you want to fill in a missing value without messing up any result
> statistics, then there is a large literature in statistics on
> imputations, repeatedly assigning values to a NA from an underlying
> distribution. scipy/statsmodels doesn't have anything like this (yet)
> but R and the others have it available, and it looks more popular in
> bio-statistics.
>
> (But similar to what Dag said, for statistical analysis it will be
> necessary to keep case specific masks and data arrays around. I
> haven't actually written any missing values algorithm yet, so I'm
> quite again.)
>
> Josef
>
>> -Chris Jordan-Squire
>>
>>>
>>> > My primary concern is that the np.NA stuff 'just
>>> > works'. Especially since I've never run into use cases in statistics
>>> > where
>>> > the difference between IGNORE and NA mattered.
>>> >
>>> >
>>> >>
>>> >>
>>> >> --
>>> >> Christopher Barker, Ph.D.
>>> >> Oceanographer
>>> >>
>>> >> Emergency Response Division
>>> >> NOAA/NOS/OR&R            (206) 526-6959   voice
>>> >> 7600 Sand Point Way NE   (206) 526-6329   fax
>>> >> Seattle, WA  98115       (206) 526-6317   main reception
>>> >>
>>> >> Chris.Barker at noaa.gov
>>> >> _______________________________________________
>>> >> NumPy-Discussion mailing list
>>> >> NumPy-Discussion at scipy.org
>>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >
>>> >
>>> > _______________________________________________
>>> > NumPy-Discussion mailing list
>>> > NumPy-Discussion at scipy.org
>>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >
>>> >
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>



More information about the NumPy-Discussion mailing list