[scikit-learn] Analysis of sklearn and other python libraries on github by MS team

Roman Yurchak rth.yurchak at gmail.com
Fri Mar 27 13:10:28 EDT 2020


Very interesting! A few comments,

 > From GH17, we managed to extract only 10.5k pipelines.  The 
relatively low frequency (with respect to the number of notebooks using 
SCIKIT-LEARN [..]) indicates a non-wide adoption of this specification. 
However, the number of pipelines in the GH19 corpus is 132k pipelines 
(i.e., an increase of 13× [..] since 2017).

It's nice to see that pipelines are indeed widely used.

 > Top-5 transformers [from imports] in GH19 are StandardScaler, 
CountVectorizer, TfidfTransformer, PolynomialFeatures, TfidfVectorizer 
(in this order).  Same are the results for GH17 with the difference that 
PCA is instead of TfidfVectorizer.

Hmm, I would have expected OneHotEncoder somewhere at the top and much 
less text processing. If there is real usage of CountVectorizer and 
TfidfTransformer separately, then maybe deprecating TfidfVectorizer 
could be done https://github.com/scikit-learn/scikit-learn/issues/14951 
Though this ranking looks quite unexpected. I wonder if they have the 
full list and not just the top5.

 > Regarding learners, Top-5 in both GH17 and GH19 are 
LogisticRegression, MultinomialNB, SVC, LinearRegression, and 
RandomForestClassifier (in this order).

Maybe LinearRegression docstring should more strongly suggest to use 
Ridge with small regularization in practice.

-- 
Roman

On 27/03/2020 17:32, Andreas Mueller wrote:
> Hey all.
> There's a pretty cool paper by a team at MS that analyses public github 
> repos for their use of the sklearn and related libraries:
> https://arxiv.org/abs/1912.09536
> 
> Thought it might be of interest.
> 
> Cheers,
> Andy
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



More information about the scikit-learn mailing list