[scikit-learn] Analysis of sklearn and other python libraries on github by MS team

Andreas Mueller t3kcit at gmail.com
Mon Mar 30 10:35:43 EDT 2020


Also see https://github.com/scikit-learn/scikit-learn/issues/14268
which is discussing how to make things faster *and* more stable!


On 3/30/20 10:30 AM, Andreas Mueller wrote:
>
>
> On 3/27/20 6:20 PM, Gael Varoquaux wrote:
>> Thanks for the link Andy. This is indeed very interesting!
>>
>> On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
>>>> Regarding learners, Top-5 in both GH17 and GH19 are 
>>>> LogisticRegression,
>>>> MultinomialNB, SVC, LinearRegression, and RandomForestClassifier 
>>>> (in this
>>>> order).
>>> Maybe LinearRegression docstring should more strongly suggest to use 
>>> Ridge
>>> with small regularization in practice.
>> Yes! I actually wonder if we should not remove LinearRegression. It's a
>> bit frightening me that so many people use it. The only time that I've
>> seen it used in a scientific people, it was a mistake and it shouldn't
>> have been used.
>>
>> I seldom advocate for deprecating :).
>>
>
> People use sklearn for inference. I'm not sure we should deprecate 
> this usecase even though it's not
> our primary motivation.
>
> Also, there's an inconsistency here: Logistic Regression has an L2 
> penalty by default (to the annoyance of some),
> while Linear Regression does not. We have discussed the meaning of the 
> different classes for linear models several times,
> they are certainly not consistent (ridge, lasso and lr are three 
> classes for squared loss while all three are in LogisticRegression for 
> the log loss).
>
> I think to many "use statsmodels" is not a satisfying answer.
>
> I have seen people argue that linear regression or logistic regression 
> should throw an error on colinear data, and I think that's not in the 
> spirit of sklearn
> (even though we had this as a warning in discriminant analysis until 
> recently).
> But we should probably have more clear signaling about this.
>
> Our documentation doesn't really emphasize the prediction vs inference 
> point enough, I think.
>
> Btw, we could also make our linear regression more stable by using the 
> minimum norm solution via the SVD.



More information about the scikit-learn mailing list