Re: NumPy and None (null, NaN, missing)
"TimC" == gestaltsystemdiscussadmin gestaltsystemdiscussadmin@lists.sourceforge.net writes:
TimC> Date: Sun, 09 Apr 2000 01:07:13 +1000 TimC> From: Tim Churches tchur@bigpond.com TimC> Organization: Gestalt Institute TimC> To: strang@nmr.mgh.harvard.edu, strang@bucky.nmr.mgh.harvard.edu, TimC> gestaltsystemdiscuss@lists.sourceforge.net, TimC> numpydiscussion@lists.sourceforge.net
TimC> I'm a new user of MumPy so forgive me if this is a FAQ. ......
TimC> I've been experimenting with using Gary Strangman's excellent stats.py TimC> functions. The spped of these functions when operating on NumPy arrays TimC> and the ability of NumPy to swallow very large arrays is remarkable.
TimC> However, one deficiency I have noticed is the lack of the ability TimC> to represent nulls (i.e. missing values, None or NaN TimC> [NotaNumber] in NumPy arrays. Missing values commonly occur in TimC> reallife statistical data and although they are usually excluded TimC> from most statistical calculations, it is important to be able to TimC> keep track of the number of missing data elements and report TimC> this.
I'm just a recent "listener" on gestaltsystemdiscuss, and don't even have any python experience. I'm member of the R core team (www.rproject.org).
In R (and even in Splus, but almost invisibly there), we even do differentiate between "NA" (missing / not available) and "NaN" (IEEE result of 0/0, etc).
I'd very much like to have these different as in R. I think our implementation of these is quite efficient, implementing NA as one particular bit pattern from the whole possible NaN set.
We use code like the following (R source, src/main/arithmetic.c ) :
static double R_ValueOfNA(void) { ieee_double x; x.word[hw] = 0x7ff00000; x.word[lw] = 1954; return x.value; }
int R_IsNA(double x) { if (isnan(x)) { ieee_double y; y.value = x; return (y.word[lw] == 1954); } return 0; }
Martin Maechler maechler@stat.math.ethz.ch http://stat.ethz.ch/~maechler/
TimC> Because NumPy arrays can't represent missing data via a TimC> special value, it is necessary to exclude missing data elements TimC> from NumPy arrays and keep track of them elsewhere (in standard TimC> Python lists). This is messy. Also, it is quite common to use TimC> various imputation techniques to estimate the values of missing TimC> data elements  the ability to represent missing data in a NumPy TimC> array and then change it to an imputed value would be a real TimC> boon.
I have sent this out before but here it is again. It is a beta of a missingobservation class. Please help me refine it and complete it. I intend to add it to the numpy distribution since this facility is muchrequested. MAtest.py shows how to use it. The intention is that it is used the same way you use a Numeric, and in fact if there are no masked values that there isn't a lot of overhead.
The basic concept is that each MA holds an array and a mask that indicates which values of the array are valid. Note the change in semantics for indexing shown below.
Later I imagine creating a compiled extension class for bit masks to improve the space and time efficiency.
Paul
# Note copy semantics here differ from Numeric def __getitem__(self, i): m = self.__mask if m is None: return Numeric.array(self.__data[i]) else: return MA(Numeric.array(self.__data[i]), Numeric.array(m[i]))
def __getslice__(self, i, j): m = self.__mask if m is None: return Numeric.array(self.__data[i:j]) else: return MA(Numeric.array(self.__data[i:j]), Numeric.array(m[i:j])) # 
participants (2)

Martin Maechler

Paul F. Dubois