[Offtopic] Line fitting [was Re: Numpy outlier removal]
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Mon Jan 7 12:58:42 EST 2013
On Mon, 07 Jan 2013 15:20:57 +0000, Oscar Benjamin wrote:
> There are sometimes good reasons to get a line of best fit by eye. In
> particular if your data contains clusters that are hard to separate,
> sometimes it's useful to just pick out roughly where you think a line
> through a subset of the data is.
Cherry picking subsets of your data as well as line fitting by eye? Two
wrongs do not make a right.
If you're going to just invent a line based on where you think it should
be, what do you need the data for? Just declare "this is the line I wish
to believe in" and save yourself the time and energy of collecting the
data in the first place. Your conclusion will be no less valid.
How do you distinguish between "data contains clusters that are hard to
separate" from "data doesn't fit a line at all"?
Even if the data actually is linear, on what basis could we distinguish
between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit
by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely
subjective judgement can be equally denied on the basis of subjective
judgement.
Anyone can fool themselves into placing a line through a subset of non-
linear data. Or, sadly more often, *deliberately* cherry picking fake
clusters in order to fool others. Here is a real world example of what
happens when people pick out the data clusters that they like based on
visual inspection:
http://www.skepticalscience.com/images/TempEscalator.gif
And not linear by any means, but related to the cherry picking theme:
http://www.skepticalscience.com/pics/1_ArcticEscalator2012.gif
To put it another way, when we fit patterns to data by eye, we can easily
fool ourselves into seeing patterns that aren't there, or missing the
patterns which are there. At best line fitting by eye is prone to honest
errors; at worst, it is open to the most deliberate abuse. We have eyes
and brains that evolved to spot the ripe fruit in trees, not to spot
linear trends in noisy data, and fitting by eye is not safe or
appropriate.
--
Steven
More information about the Python-list
mailing list