Numpy outlier removal
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Mon Jan 7 00:11:13 EST 2013
On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote:
> On 7 January 2013 01:46, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>>
>>> I have a dataset that consists of a dict with text descriptions and
>>> values that are integers. If required, I collect the values into a
>>> list and create a numpy array running it through a simple routine:
>>>
>>> data[abs(data - mean(data)) < m * std(data)]
>>>
>>> where m is the number of std deviations to include.
>>
>> I'm not sure that this approach is statistically robust. No, let me be
>> even more assertive: I'm sure that this approach is NOT statistically
>> robust, and may be scientifically dubious.
>
> Whether or not this is "statistically robust" requires more explanation
> about the OP's intention.
Not really. Statistics robustness is objectively defined, and the user's
intention doesn't come into it. The mean is not a robust measure of
central tendency, the median is, regardless of why you pick one or the
other.
There are sometimes good reasons for choosing non-robust statistics or
techniques over robust ones, but some techniques are so dodgy that there
is *never* a good reason for doing so. E.g. finding the line of best fit
by eye, or taking more and more samples until you get a statistically
significant result. Such techniques are not just non-robust in the
statistical sense, but non-robust in the general sense, if not outright
deceitful.
--
Steven
More information about the Python-list
mailing list