Re: NumPy and None (null, NaN, missing)

"TimC" == gestalt-system-discuss-admin <gestalt-system-discuss-admin@lists.sourceforge.net> writes:
TimC> Date: Sun, 09 Apr 2000 01:07:13 +1000 TimC> From: Tim Churches <tchur@bigpond.com> TimC> Organization: Gestalt Institute TimC> To: strang@nmr.mgh.harvard.edu, strang@bucky.nmr.mgh.harvard.edu, TimC> gestalt-system-discuss@lists.sourceforge.net, TimC> numpy-discussion@lists.sourceforge.net TimC> I'm a new user of MumPy so forgive me if this is a FAQ. ...... TimC> I've been experimenting with using Gary Strangman's excellent stats.py TimC> functions. The spped of these functions when operating on NumPy arrays TimC> and the ability of NumPy to swallow very large arrays is remarkable. TimC> However, one deficiency I have noticed is the lack of the ability TimC> to represent nulls (i.e. missing values, None or NaN TimC> [Not-a-Number] in NumPy arrays. Missing values commonly occur in TimC> real-life statistical data and although they are usually excluded TimC> from most statistical calculations, it is important to be able to TimC> keep track of the number of missing data elements and report TimC> this. I'm just a recent "listener" on gestalt-system-discuss, and don't even have any python experience. I'm member of the R core team (www.r-project.org). In R (and even in S-plus, but almost invisibly there), we even do differentiate between "NA" (missing / not available) and "NaN" (IEEE result of 0/0, etc). I'd very much like to have these different as in R. I think our implementation of these is quite efficient, implementing NA as one particular bit pattern from the whole possible NaN set. We use code like the following (R source, src/main/arithmetic.c ) : static double R_ValueOfNA(void) { ieee_double x; x.word[hw] = 0x7ff00000; x.word[lw] = 1954; return x.value; } int R_IsNA(double x) { if (isnan(x)) { ieee_double y; y.value = x; return (y.word[lw] == 1954); } return 0; } Martin Maechler <maechler@stat.math.ethz.ch> http://stat.ethz.ch/~maechler/ TimC> Because NumPy arrays can't represent missing data via a TimC> special value, it is necessary to exclude missing data elements TimC> from NumPy arrays and keep track of them elsewhere (in standard TimC> Python lists). This is messy. Also, it is quite common to use TimC> various imputation techniques to estimate the values of missing TimC> data elements - the ability to represent missing data in a NumPy TimC> array and then change it to an imputed value would be a real TimC> boon.

I have sent this out before but here it is again. It is a beta of a missing-observation class. Please help me refine it and complete it. I intend to add it to the numpy distribution since this facility is much-requested. MAtest.py shows how to use it. The intention is that it is used the same way you use a Numeric, and in fact if there are no masked values that there isn't a lot of overhead. The basic concept is that each MA holds an array and a mask that indicates which values of the array are valid. Note the change in semantics for indexing shown below. Later I imagine creating a compiled extension class for bit masks to improve the space and time efficiency. Paul # Note copy semantics here differ from Numeric def __getitem__(self, i): m = self.__mask if m is None: return Numeric.array(self.__data[i]) else: return MA(Numeric.array(self.__data[i]), Numeric.array(m[i])) def __getslice__(self, i, j): m = self.__mask if m is None: return Numeric.array(self.__data[i:j]) else: return MA(Numeric.array(self.__data[i:j]), Numeric.array(m[i:j])) # --------

I have sent this out before but here it is again. It is a beta of a missing-observation class. Please help me refine it and complete it. I intend to add it to the numpy distribution since this facility is much-requested. MAtest.py shows how to use it. The intention is that it is used the same way you use a Numeric, and in fact if there are no masked values that there isn't a lot of overhead. The basic concept is that each MA holds an array and a mask that indicates which values of the array are valid. Note the change in semantics for indexing shown below. Later I imagine creating a compiled extension class for bit masks to improve the space and time efficiency. Paul # Note copy semantics here differ from Numeric def __getitem__(self, i): m = self.__mask if m is None: return Numeric.array(self.__data[i]) else: return MA(Numeric.array(self.__data[i]), Numeric.array(m[i])) def __getslice__(self, i, j): m = self.__mask if m is None: return Numeric.array(self.__data[i:j]) else: return MA(Numeric.array(self.__data[i:j]), Numeric.array(m[i:j])) # --------
participants (2)
-
Martin Maechler
-
Paul F. Dubois