Analysis of sklearn and other python libraries on github by MS team
Hey all. There's a pretty cool paper by a team at MS that analyses public github repos for their use of the sklearn and related libraries: https://arxiv.org/abs/1912.09536 Thought it might be of interest. Cheers, Andy
Very interesting! A few comments,
From GH17, we managed to extract only 10.5k pipelines. The relatively low frequency (with respect to the number of notebooks using SCIKIT-LEARN [..]) indicates a non-wide adoption of this specification. However, the number of pipelines in the GH19 corpus is 132k pipelines (i.e., an increase of 13× [..] since 2017).
It's nice to see that pipelines are indeed widely used.
Top-5 transformers [from imports] in GH19 are StandardScaler, CountVectorizer, TfidfTransformer, PolynomialFeatures, TfidfVectorizer (in this order). Same are the results for GH17 with the difference that PCA is instead of TfidfVectorizer.
Hmm, I would have expected OneHotEncoder somewhere at the top and much less text processing. If there is real usage of CountVectorizer and TfidfTransformer separately, then maybe deprecating TfidfVectorizer could be done https://github.com/scikit-learn/scikit-learn/issues/14951 Though this ranking looks quite unexpected. I wonder if they have the full list and not just the top5.
Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression, MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this order).
Maybe LinearRegression docstring should more strongly suggest to use Ridge with small regularization in practice. -- Roman On 27/03/2020 17:32, Andreas Mueller wrote:
Hey all. There's a pretty cool paper by a team at MS that analyses public github repos for their use of the sklearn and related libraries: https://arxiv.org/abs/1912.09536
Thought it might be of interest.
Cheers, Andy _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Thanks for the link Andy. This is indeed very interesting! On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression, MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this order).
Maybe LinearRegression docstring should more strongly suggest to use Ridge with small regularization in practice.
Yes! I actually wonder if we should not remove LinearRegression. It's a bit frightening me that so many people use it. The only time that I've seen it used in a scientific people, it was a mistake and it shouldn't have been used. I seldom advocate for deprecating :). G
On 3/27/20 6:20 PM, Gael Varoquaux wrote:
Thanks for the link Andy. This is indeed very interesting!
On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression, MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this order). Maybe LinearRegression docstring should more strongly suggest to use Ridge with small regularization in practice. Yes! I actually wonder if we should not remove LinearRegression. It's a bit frightening me that so many people use it. The only time that I've seen it used in a scientific people, it was a mistake and it shouldn't have been used.
I seldom advocate for deprecating :).
People use sklearn for inference. I'm not sure we should deprecate this usecase even though it's not our primary motivation. Also, there's an inconsistency here: Logistic Regression has an L2 penalty by default (to the annoyance of some), while Linear Regression does not. We have discussed the meaning of the different classes for linear models several times, they are certainly not consistent (ridge, lasso and lr are three classes for squared loss while all three are in LogisticRegression for the log loss). I think to many "use statsmodels" is not a satisfying answer. I have seen people argue that linear regression or logistic regression should throw an error on colinear data, and I think that's not in the spirit of sklearn (even though we had this as a warning in discriminant analysis until recently). But we should probably have more clear signaling about this. Our documentation doesn't really emphasize the prediction vs inference point enough, I think. Btw, we could also make our linear regression more stable by using the minimum norm solution via the SVD.
Also see https://github.com/scikit-learn/scikit-learn/issues/14268 which is discussing how to make things faster *and* more stable! On 3/30/20 10:30 AM, Andreas Mueller wrote:
On 3/27/20 6:20 PM, Gael Varoquaux wrote:
Thanks for the link Andy. This is indeed very interesting!
On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression, MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this order). Maybe LinearRegression docstring should more strongly suggest to use Ridge with small regularization in practice. Yes! I actually wonder if we should not remove LinearRegression. It's a bit frightening me that so many people use it. The only time that I've seen it used in a scientific people, it was a mistake and it shouldn't have been used.
I seldom advocate for deprecating :).
People use sklearn for inference. I'm not sure we should deprecate this usecase even though it's not our primary motivation.
Also, there's an inconsistency here: Logistic Regression has an L2 penalty by default (to the annoyance of some), while Linear Regression does not. We have discussed the meaning of the different classes for linear models several times, they are certainly not consistent (ridge, lasso and lr are three classes for squared loss while all three are in LogisticRegression for the log loss).
I think to many "use statsmodels" is not a satisfying answer.
I have seen people argue that linear regression or logistic regression should throw an error on colinear data, and I think that's not in the spirit of sklearn (even though we had this as a warning in discriminant analysis until recently). But we should probably have more clear signaling about this.
Our documentation doesn't really emphasize the prediction vs inference point enough, I think.
Btw, we could also make our linear regression more stable by using the minimum norm solution via the SVD.
participants (3)
-
Andreas Mueller -
Gael Varoquaux -
Roman Yurchak