[SciPy-User] removing multiple occurrences of a specific value (or range of values) from an array

Brennan Williams brennan.williams at visualreservoir.com
Tue Jan 11 17:46:57 EST 2011


On 12/01/2011 3:55 a.m., Bruce Southey wrote:
> On 01/10/2011 08:33 PM, Brennan Williams wrote:
>> On 11/01/2011 3:15 p.m., josef.pktd at gmail.com wrote:
>>> On Mon, Jan 10, 2011 at 8:57 PM, Brennan Williams
>>> <brennan.williams at visualreservoir.com>    wrote:
>>>> On 11/01/2011 2:48 p.m., josef.pktd at gmail.com wrote:
>>>>> On Mon, Jan 10, 2011 at 8:42 PM, Brennan Williams
>>>>> <brennan.williams at visualreservoir.com>      wrote:
>>>>>> I have a numpy array and I use .min(), .max(), .std(), average(..),
>>>>>> median(...) etc to get various stats values.
>>>>>>
>>>>>> Depending on where the data originally came from, the array can contain
>>>>>> a null value which could be 1.0e+20 or similar (can vary from dataset to
>>>>>> dataset). Due to rounding errors this can sometimes appear as something
>>>>>> like 1.0000002004e+20 etc etc.
>>>>>>
>>>>>> So I want to be able to correctly calculate the stats values by ignoring
>>>>>> the null values.
>>>>>>
>>>>>> I also want to be able to replace the null values with another value
>>>>>> (for plotting/exporting).
>>>>>>
>>>>>> What's the best way to do this without looping over the elements of the
>>>>>> array?
>>>>> If you don't have anything large, then you could just do
>>>>>
>>>>> x[x>1e19]=np.nan
>>>>>
>>>>> or filter them out, or convert to masked array.
>>>>>
>>>> the array is usually<10,000 values, often<1000
>>>>
>>>> On a separate note I found that .std() didn't return a valid value when
>>>> I have a lot of 1.0e+20's in the array. I realise that it is probably a
>>>> single precision issue and I probably won't need to worry about this in
>>>> future but I presume I should use .std(dtype=float64) ?
>>> I'm not sure I understand
>>> Are you using np.std on the array with missing values still in there
>>> encoded with 1e20?
>>> That wouldn't give the right values.
>> I was, but that was because I didn't realise the array had bad/missing
>> values in it. In theory the data coming in should just have been an
>> array of zero's but it is inconsistent, for some reason I haven't been
>> able to work out yet.
>>> may numpy seems to promote automatically (contrary to the docs)
>>>
>>>>>> np.arange(5, dtype='float32').std()
>>> 1.4142135623730951
>>>>>> np.arange(5, dtype='float32').std().dtype
>>> dtype('float64')
>>>>>> np.arange(5, dtype='float32').dtype
>>> dtype('float32')
>>>
>> Hmmm
>>
>>
>> a=np.range(5,dtype='float32')
>> b=a*1.0e+20
>> b.std()
>> returns inf
>> whereas
>> b.std(dtype='float64')
>> returns 1.4142....e+20
>>
>> Brennan
>>
> That's because you are not being careful about your floating point
> precision.
>
> Since standard deviation involves squaring, so the required immediate
> precision is over 1.e+40 which exceeds the 32-bit precision. (On my x86
> 64-bit Linux system, 1.e+308 is about the limit for 64bit and 1.e+4932
> is the limit for float128 - these limits do vary across processors and
> operating systems.)
>
>   >>>  print np.finfo(np.float32)
> Machine parameters for float32
> ---------------------------------------------------------------------
> precision=  6   resolution= 1.0000000e-06
> machep=   -23   eps=        1.1920929e-07
> negep =   -24   epsneg=     5.9604645e-08
> minexp=  -126   tiny=       1.1754944e-38
> maxexp=   128   max=        3.4028235e+38
> nexp  =     8   min=        -max
> ---------------------------------------------------------------------
>
> You can also see this just by doing:
>   >>>  b*b
> Warning: overflow encountered in multiply
> array([  0.,  inf,  inf,  inf,  inf], dtype=float32)
>
> Bruce
>
>
>
My values shouldn't be up in the e+20 range anyway, it's effectively a 
null or indeterminate value. However that isn't to say that I shouldn't 
code up to handle it which I will now do.
The slightly confusing thing to me is that .std() returns a float64 but 
if you don't specify .std(dtype='float64') then you run into the 
precision problem. So presumably, internally, std is using 32-bit and 
only converts to 64-bit on output. Either way it's fairly obvious what 
is going on and easy to sort out once you (as in me) realise the mistake.

Thanks for the tip about finfo

Brennan

> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>





More information about the SciPy-User mailing list