Re: [Numpy-discussion] [Cdat-discussion] Arrays containing NaNs

Hi All, I'm sending a copy of this reply here because i think we could get some good answer. Basically it was suggested to automarically mask NaN (and Inf ?) when creating ma. I'm sure you already thought of this on this list and was curious to know why you decided not to do it. Just so I can relay it to our list (sending to both list came back flagged as spam...) C. Hi Stephane, This is a good suggestion, I'm ccing the numpy list on this. Because I'm wondering if it wouldn't be a better fit to do it directly at the numpy.ma level. I'm sure they already thought about this (and 'inf' values as well) and if they don't do it , there's probably some good reason we didn't think of yet. So before i go ahead and do it in MV2 I'd like to know the reason why it's not in numpy.ma, they are probably valid for MVs too. C. Stephane Raynaud wrote:
Hi,
how about automatically (or at least optionally) masking all NaN values when creating a MV array?
On Thu, Jul 24, 2008 at 11:43 PM, Arthur M. Greene <amg@iri.columbia.edu <mailto:amg@iri.columbia.edu>> wrote:
Yup, this works. Thanks!
I guess it's time for me to dig deeper into numpy syntax and functions, now that CDAT is using the numpy core for array management...
Best,
Arthur
Charles Doutriaux wrote:
Seems right to me,
Except that the syntax might scare a bit the new users :)
C.
Andrew.Dawson@uea.ac.uk <mailto:Andrew.Dawson@uea.ac.uk> wrote:
Hi,
I'm not sure if what I am about to suggest is a good idea or not, perhaps Charles will correct me if this is a bad idea for any reason.
Lets say you have a cdms variable called U with NaNs as the missing value. First we can replace the NaNs with 1e20:
U.data[numpy.where(numpy.isnan(U.data))] = 1e20
And remember to set the missing value of the variable appropriately:
U.setMissing(1e20)
I hope that helps, Andrew
Hi Arthur,
If i remember correctly the way i used to do it was: a= MV2.greater(data,1.) b=MV2.less_equal(data,1) c=MV2.logical_and(a,b) # Nan are the only one left data=MV2.masked_where(c,data)
BUT I believe numpy now has way to deal with nan I believe it is numpy.nan_to_num But it replaces with 0 so it may not be what you want
C.
Arthur M. Greene wrote:
A typical netcdf file is opened, and the single variable extracted:
fpr=cdms.open('prTS2p1_SEA_allmos.cdf') pr0=fpr('prcp') type(pr0)
<class 'cdms2.tvariable.TransientVariable'>
Masked values (indicating ocean in this case) show up here as NaNs.
pr0[0,-15:-5,0]
prcp array([NaN NaN NaN NaN NaN NaN 0.37745094 0.3460784 0.21960783 0.19117641])
So far this is all consistent. A map of the first time step shows the proper land-ocean boundaries, reasonable-looking values, and so on. But there doesn't seem to be any way to mask this array, so, e.g., an 'xy' average can be computed (it comes out all nans). NaN is not equal to anything -- even itself -- so there does not seem to be any condition, among the MV.masked_xxx options, that can be applied as a test. Also, it does not seem possible to compute seasonal averages, anomalies, etc. -- they also produce just NaNs.
The workaround I've come up with -- for now -- is to first generate a new array of identical shape, filled with 1.0E+20. One test I've found that can detect NaNs is numpy.isnan:
isnan(pr0[0,0,0])
True
So it is _possible_ to tediously loop through every value in the old array, testing with isnan, then copying to the new array if the test fails. Then the axes have to be reset...
isnan does not accept array arguments, so one cannot do, e.g.,
prmasked=MV.masked_where(isnan(pr0),pr0)
The element-by-element conversion is quite slow. (I'm still waiting for it to complete, in fact). Any suggestions for dealing with NaN-infested data objects?
Thanks!
AMG
P.S. This is 5.0.0.beta, RHEL4.
*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~* Arthur M. Greene, Ph.D. The International Research Institute for Climate and Society The Earth Institute, Columbia University, Lamont Campus Monell Building, 61 Route 9W, Palisades, NY 10964-8000 USA amg*at*iri-dot-columbia\dot\edu | http://iri.columbia.edu *^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*^*~*
------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ <http://moblin-contest.org/redirect.php?banner_id=100&url=/> _______________________________________________ Cdat-discussion mailing list Cdat-discussion@lists.sourceforge.net <mailto:Cdat-discussion@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/cdat-discussion
-- Stephane Raynaud ------------------------------------------------------------------------
------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http:// moblin-contest.org/redirect.php?banner_id=100&url=/ ------------------------------------------------------------------------
_______________________________________________ Cdat-discussion mailing list Cdat-discussion@lists.sourceforge.net https:// lists.sourceforge.net/lists/listinfo/cdat-discussion

Oh, I guess this one's for me... On Thursday 01 January 1970 04:21:03 Charles Doutriaux wrote:
Basically it was suggested to automarically mask NaN (and Inf ?) when creating ma. I'm sure you already thought of this on this list and was curious to know why you decided not to do it.
Because it's always best to let the user decide what to do with his/her data and not impose anything ? Masking a point doesn't necessarily mean that the point is invalid (in the sense of NaNs/Infs), just that it doesn't satisfy some particular condition. In that sense, masks act as selecting tools. By forcing invalid data to be masked at the creation of an array, you run the risk to tamper with the (potential) physical meaning of the mask you have given as input, and/or miss the fact that some data are actually invalid when you don't expect it to be. Let's take an example: I want to analyze sea surface temperatures at the world scale. The data comes as a regular 2D ndarray, with NaNs for missing or invalid data. In a first step, I create a masked array of this data, filtering out the land masses by a predefined geographical mask. The remaining NaNs in the masked array indicate areas where the sensor failed... It's an important information I would probably have missed by masking all the NaNs at first... As Eric F. suggested, you can use numpy.ma.masked_invalid to create a masked array with NaNs/Infs filtered out:
import numpy as np,. numpy.ma as ma x = np.array([1,2,None,4], dtype=float) x array([ 1., 2., NaN, 4.]) mx = ma.masked_invalid(x) mx masked_array(data = [1.0 2.0 -- 4.0], mask = [False False True False], fill_value=1e+20)
Note that the underlying data still has NaNs/Infs:
mx._data array([ 1., 2., NaN, 4.])
You can also use the ma.fix_invalid function: it creates a mask where the data is not finite (NaNs/Infs), and set the corresponding points to fill_value.
mx = ma.fix_invalid(x, fill_value=999) mx masked_array(data = [1.0 2.0 -- 4.0], mask = [False False True False], fill_value=1e+20) mx._data array([ 1., 2., 999., 4.])
The advantage of the second approach is that you no longer have NaNs/Infs in the underlying data, which speeds things up during computation. The obvious disadvantage is that you no longer know where the data was invalid...

Hi Pierre, Thanks for the answer, I'm ccing cdat's discussion list. It makes sense, that's also the way we develop things here NEVER assume what the user is going to do with the data BUT give the user the necessary tools to do what you're assuming he/she wants to do (as simple as possible) Thanks again for the answer. C. Pierre GM wrote:
Oh, I guess this one's for me...
On Thursday 01 January 1970 04:21:03 Charles Doutriaux wrote:
Basically it was suggested to automarically mask NaN (and Inf ?) when creating ma. I'm sure you already thought of this on this list and was curious to know why you decided not to do it.
Because it's always best to let the user decide what to do with his/her data and not impose anything ?
Masking a point doesn't necessarily mean that the point is invalid (in the sense of NaNs/Infs), just that it doesn't satisfy some particular condition. In that sense, masks act as selecting tools.
By forcing invalid data to be masked at the creation of an array, you run the risk to tamper with the (potential) physical meaning of the mask you have given as input, and/or miss the fact that some data are actually invalid when you don't expect it to be.
Let's take an example: I want to analyze sea surface temperatures at the world scale. The data comes as a regular 2D ndarray, with NaNs for missing or invalid data. In a first step, I create a masked array of this data, filtering out the land masses by a predefined geographical mask. The remaining NaNs in the masked array indicate areas where the sensor failed... It's an important information I would probably have missed by masking all the NaNs at first...
As Eric F. suggested, you can use numpy.ma.masked_invalid to create a masked array with NaNs/Infs filtered out:
import numpy as np,. numpy.ma as ma x = np.array([1,2,None,4], dtype=float) x
array([ 1., 2., NaN, 4.])
mx = ma.masked_invalid(x) mx
masked_array(data = [1.0 2.0 -- 4.0], mask = [False False True False], fill_value=1e+20)
Note that the underlying data still has NaNs/Infs:
mx._data
array([ 1., 2., NaN, 4.])
You can also use the ma.fix_invalid function: it creates a mask where the data is not finite (NaNs/Infs), and set the corresponding points to fill_value.
mx = ma.fix_invalid(x, fill_value=999) mx
masked_array(data = [1.0 2.0 -- 4.0], mask = [False False True False], fill_value=1e+20)
mx._data
array([ 1., 2., 999., 4.])
The advantage of the second approach is that you no longer have NaNs/Infs in the underlying data, which speeds things up during computation. The obvious disadvantage is that you no longer know where the data was invalid...
participants (2)
-
Charles Doutriaux
-
Pierre GM