[scikit-learn] Analysis of sklearn and other python libraries on github by MS team

Mon Mar 30 10:30:09 EDT 2020

On 3/27/20 6:20 PM, Gael Varoquaux wrote:
> Thanks for the link Andy. This is indeed very interesting!
>
> On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
>>> Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression,
>>> MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this
>>> order).
>> Maybe LinearRegression docstring should more strongly suggest to use Ridge
>> with small regularization in practice.
> Yes! I actually wonder if we should not remove LinearRegression. It's a
> bit frightening me that so many people use it. The only time that I've
> seen it used in a scientific people, it was a mistake and it shouldn't
> have been used.
>
> I seldom advocate for deprecating :).
>

People use sklearn for inference. I'm not sure we should deprecate this 
usecase even though it's not
our primary motivation.

Also, there's an inconsistency here: Logistic Regression has an L2 
penalty by default (to the annoyance of some),
while Linear Regression does not. We have discussed the meaning of the 
different classes for linear models several times,
they are certainly not consistent (ridge, lasso and lr are three classes 
for squared loss while all three are in LogisticRegression for the log 
loss).

I think to many "use statsmodels" is not a satisfying answer.

I have seen people argue that linear regression or logistic regression 
should throw an error on colinear data, and I think that's not in the 
spirit of sklearn
(even though we had this as a warning in discriminant analysis until 
recently).
But we should probably have more clear signaling about this.

Our documentation doesn't really emphasize the prediction vs inference 
point enough, I think.

Btw, we could also make our linear regression more stable by using the 
minimum norm solution via the SVD.