# [Offtopic] Line fitting [was Re: Numpy outlier removal]

Terry Reedy tjreedy at udel.edu
Tue Jan 8 10:07:08 CET 2013

```On 1/7/2013 8:23 PM, Steven D'Aprano wrote:
> On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote:
>
>> An example: Earlier today I was looking at some experimental data. A
>> simple model of the process underlying the experiment suggests that two
>> variables x and y will vary in direct proportion to one another and the
>> data broadly reflects this. However, at this stage there is some
>> non-normal variability in the data, caused by experimental difficulties.
>> A subset of the data appears to closely follow a well defined linear
>> pattern but there are outliers and the pattern breaks down in an
>> asymmetric way at larger x and y values. At some later time either the
>> sources of experimental variation will be reduced, or they will be
>> better understood but for now it is still useful to estimate the
>> constant of proportionality in order to check whether it seems
>> consistent with the observed values of z. With this particular dataset I
>> would have wasted a lot of time if I had tried to find a computational
>> method to match the line that to me was very visible so I chose the line
>> visually.
>
>
> If you mean:
>
> "I looked at the data, identified that the range a < x < b looks linear
> and the range x > b does not, then used least squares (or some other
> recognised, objective technique for fitting a line) to the data in that
> linear range"
>
> then I'm completely cool with that.

If both x and y are measured values, then regressing x on y and y on x
with give different answers and both will be wrong in that *neither*
will be the best answer for the relationship between them. Oscar did not
specify whether either was an experimentally set input variable.

> But that is not fitting a line by eye, which is what I am talking about.

With the line constrained to go through 0,0, a line eyeballed with a
clear ruler could easily be better than either regression line, as a
human will tend to minimize the deviations *perpendicular to the  line*,
which is the proper thing to do (assuming both variables are measured in
the same units).

--
Terry Jan Reedy

```