[SciPy-User] removing multiple occurrences of a specific value (or range of values) from an array

Wed Jan 12 10:07:02 EST 2011

On 01/11/2011 04:46 PM, Brennan Williams wrote:
> On 12/01/2011 3:55 a.m., Bruce Southey wrote:
>> On 01/10/2011 08:33 PM, Brennan Williams wrote:
>>> On 11/01/2011 3:15 p.m., josef.pktd at gmail.com wrote:
>>>> On Mon, Jan 10, 2011 at 8:57 PM, Brennan Williams
>>>> <brennan.williams at visualreservoir.com>     wrote:
>>>>> On 11/01/2011 2:48 p.m., josef.pktd at gmail.com wrote:
>>>>>> On Mon, Jan 10, 2011 at 8:42 PM, Brennan Williams
>>>>>> <brennan.williams at visualreservoir.com>       wrote:
>>>>>>> I have a numpy array and I use .min(), .max(), .std(), average(..),
>>>>>>> median(...) etc to get various stats values.
>>>>>>>
>>>>>>> Depending on where the data originally came from, the array can contain
>>>>>>> a null value which could be 1.0e+20 or similar (can vary from dataset to
>>>>>>> dataset). Due to rounding errors this can sometimes appear as something
>>>>>>> like 1.0000002004e+20 etc etc.
>>>>>>>
>>>>>>> So I want to be able to correctly calculate the stats values by ignoring
>>>>>>> the null values.
>>>>>>>
>>>>>>> I also want to be able to replace the null values with another value
>>>>>>> (for plotting/exporting).
>>>>>>>
>>>>>>> What's the best way to do this without looping over the elements of the
>>>>>>> array?
>>>>>> If you don't have anything large, then you could just do
>>>>>>
>>>>>> x[x>1e19]=np.nan
>>>>>>
>>>>>> or filter them out, or convert to masked array.
>>>>>>
>>>>> the array is usually<10,000 values, often<1000
>>>>>
>>>>> On a separate note I found that .std() didn't return a valid value when
>>>>> I have a lot of 1.0e+20's in the array. I realise that it is probably a
>>>>> single precision issue and I probably won't need to worry about this in
>>>>> future but I presume I should use .std(dtype=float64) ?
>>>> I'm not sure I understand
>>>> Are you using np.std on the array with missing values still in there
>>>> encoded with 1e20?
>>>> That wouldn't give the right values.
>>> I was, but that was because I didn't realise the array had bad/missing
>>> values in it. In theory the data coming in should just have been an
>>> array of zero's but it is inconsistent, for some reason I haven't been
>>> able to work out yet.
>>>> may numpy seems to promote automatically (contrary to the docs)
>>>>
>>>>>>> np.arange(5, dtype='float32').std()
>>>> 1.4142135623730951
>>>>>>> np.arange(5, dtype='float32').std().dtype
>>>> dtype('float64')
>>>>>>> np.arange(5, dtype='float32').dtype
>>>> dtype('float32')
>>>>
>>> Hmmm
>>>
>>>
>>> a=np.range(5,dtype='float32')
>>> b=a*1.0e+20
>>> b.std()
>>> returns inf
>>> whereas
>>> b.std(dtype='float64')
>>> returns 1.4142....e+20
>>>
>>> Brennan
>>>
>> That's because you are not being careful about your floating point
>> precision.
>>
>> Since standard deviation involves squaring, so the required immediate
>> precision is over 1.e+40 which exceeds the 32-bit precision. (On my x86
>> 64-bit Linux system, 1.e+308 is about the limit for 64bit and 1.e+4932
>> is the limit for float128 - these limits do vary across processors and
>> operating systems.)
>>
>>    >>>   print np.finfo(np.float32)
>> Machine parameters for float32
>> ---------------------------------------------------------------------
>> precision=  6   resolution= 1.0000000e-06
>> machep=   -23   eps=        1.1920929e-07
>> negep =   -24   epsneg=     5.9604645e-08
>> minexp=  -126   tiny=       1.1754944e-38
>> maxexp=   128   max=        3.4028235e+38
>> nexp  =     8   min=        -max
>> ---------------------------------------------------------------------
>>
>> You can also see this just by doing:
>>    >>>   b*b
>> Warning: overflow encountered in multiply
>> array([  0.,  inf,  inf,  inf,  inf], dtype=float32)
>>
>> Bruce
>>
>>
>>
> My values shouldn't be up in the e+20 range anyway, it's effectively a
> null or indeterminate value. However that isn't to say that I shouldn't
> code up to handle it which I will now do.
That is why I like masked arrays because it can handle arbitrary values 
for that situation - obviously you do need some idea first that these 
type of values exist. The other aspect is that numpy uses float64 by 
default so this type of issue is 'hidden' until more extreme cases 
occur. Then you can potentially move to a higher precision if os/cpu 
supports it.

> The slightly confusing thing to me is that .std() returns a float64 but
> if you don't specify .std(dtype='float64') then you run into the
> precision problem. So presumably, internally, std is using 32-bit and
> only converts to 64-bit on output.
I think many of us thought that this was the case but Keith pointed out 
that was incorrect when the axis argument is not None:
http://mail.scipy.org/pipermail/numpy-discussion/2010-December/054253.html

Thus, currently the output dtype will be determined by the input dtype, 
the 'dtype' argument and axis argument.

If the axis is None (default), then the output dtype will be float64 for 
all inputs (integers and floats) unless the input has a better precision 
than float64 (like float128). When the 'dtype' argument is given, the 
output dtype is the dtype with the better precision of float64 and 
precision 'dtype' argument (so if dtype=np.float128 then output will be 
float128).

Late last year it was found that when the axis argument is specified to 
be something not None then numpy does not necessarily upconvert the 
input to float64. For example, the output dtype is the same as the input 
dtype or the 'dtype' argument eventhough in both cases the precision is 
less than float64.
 >>> np.array([[1, 2], [3, 4]], dtype=np.float32).mean(axis=0).dtype
dtype('float32')
 >>> np.array([[1, 2], [3, 4]], dtype=np.int8).mean(axis=0, 
dtype=np.float32).dtype
dtype('float32')

> Either way it's fairly obvious what
> is going on and easy to sort out once you (as in me) realise the mistake.
>
> Thanks for the tip about finfo
>
> Brennan
FYI: iinfo is the integer version.

Bruce