Numpy outlier removal
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun Jan 6 20:46:15 EST 2013
On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
> I have a dataset that consists of a dict with text descriptions and
> values that are integers. If required, I collect the values into a list
> and create a numpy array running it through a simple routine:
>
> data[abs(data - mean(data)) < m * std(data)]
>
> where m is the number of std deviations to include.
I'm not sure that this approach is statistically robust. No, let me be
even more assertive: I'm sure that this approach is NOT statistically
robust, and may be scientifically dubious.
The above assumes your data is normally distributed. How sure are you
that this is actually the case?
For normally distributed data:
Since both the mean and std calculations as effected by the presence of
outliers, your test for what counts as an outlier will miss outliers for
data from a normal distribution. For small N (sample size), it may be
mathematically impossible for any data point to be greater than m*SD from
the mean. For example, with N=5, no data point can be more than 1.789*SD
from the mean. So for N=5, m=1 may throw away good data, and m=2 will
fail to find any outliers no matter how outrageous they are.
For large N, you will expect to find significant numbers of data points
more than m*SD from the mean. With N=100000, and m=3, you will expect to
throw away 270 perfectly good data points simply because they are out on
the tails of the distribution.
Worse, if the data is not in fact from a normal distribution, all bets
are off. You may be keeping obvious outliers; or more often, your test
will be throwing away perfectly good data that it misidentifies as
outliers.
In other words: this approach for detecting outliers is nothing more than
a very rough, and very bad, heuristic, and should be avoided.
Identifying outliers is fraught with problems even for experts. For
example, the ozone hole over the Antarctic was ignored for many years
because the software being used to analyse it misidentified the data as
outliers.
The best general advice I have seen is:
Never automatically remove outliers except for values that are physically
impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"),
unless you have good, solid, physical reasons for justifying removal of
outliers. Other than that, manually remove outliers with care, or not at
all, and if you do so, always report your results twice, once with all
the data, and once with supposed outliers removed.
You can read up more about outlier detection, and the difficulties
thereof, here:
http://www.medcalc.org/manual/outliers.php
https://secure.graphpad.com/guides/prism/6/statistics/index.htm
http://www.webapps.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html
http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations
--
Steven
More information about the Python-list
mailing list