Numpy outlier removal

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Jan 7 02:46:15 CET 2013


On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:

> I have a dataset that consists of a dict with text descriptions and 
> values that are integers. If required, I collect the values into a list 
> and create a numpy array running it through a simple routine: 
> 
> data[abs(data - mean(data)) < m * std(data)] 
>
> where m is the number of std deviations to include.

I'm not sure that this approach is statistically robust. No, let me be 
even more assertive: I'm sure that this approach is NOT statistically 
robust, and may be scientifically dubious.

The above assumes your data is normally distributed. How sure are you 
that this is actually the case?

For normally distributed data:

Since both the mean and std calculations as effected by the presence of 
outliers, your test for what counts as an outlier will miss outliers for 
data from a normal distribution. For small N (sample size), it may be 
mathematically impossible for any data point to be greater than m*SD from 
the mean. For example, with N=5, no data point can be more than 1.789*SD 
from the mean. So for N=5, m=1 may throw away good data, and m=2 will 
fail to find any outliers no matter how outrageous they are.

For large N, you will expect to find significant numbers of data points 
more than m*SD from the mean. With N=100000, and m=3, you will expect to 
throw away 270 perfectly good data points simply because they are out on 
the tails of the distribution.

Worse, if the data is not in fact from a normal distribution, all bets 
are off. You may be keeping obvious outliers; or more often, your test 
will be throwing away perfectly good data that it misidentifies as 
outliers.

In other words: this approach for detecting outliers is nothing more than 
a very rough, and very bad, heuristic, and should be avoided.

Identifying outliers is fraught with problems even for experts. For 
example, the ozone hole over the Antarctic was ignored for many years 
because the software being used to analyse it misidentified the data as 
outliers.

The best general advice I have seen is:

Never automatically remove outliers except for values that are physically 
impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"), 
unless you have good, solid, physical reasons for justifying removal of 
outliers. Other than that, manually remove outliers with care, or not at 
all, and if you do so, always report your results twice, once with all 
the data, and once with supposed outliers removed.

You can read up more about outlier detection, and the difficulties 
thereof, here:

http://www.medcalc.org/manual/outliers.php

https://secure.graphpad.com/guides/prism/6/statistics/index.htm

http://www.webapps.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html

http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations



-- 
Steven



More information about the Python-list mailing list