[scikit-learn] Analysis of sklearn and other python libraries on github by MS team
Roman Yurchak
rth.yurchak at gmail.com
Fri Mar 27 13:10:28 EDT 2020
Very interesting! A few comments,
> From GH17, we managed to extract only 10.5k pipelines. The
relatively low frequency (with respect to the number of notebooks using
SCIKIT-LEARN [..]) indicates a non-wide adoption of this specification.
However, the number of pipelines in the GH19 corpus is 132k pipelines
(i.e., an increase of 13× [..] since 2017).
It's nice to see that pipelines are indeed widely used.
> Top-5 transformers [from imports] in GH19 are StandardScaler,
CountVectorizer, TfidfTransformer, PolynomialFeatures, TfidfVectorizer
(in this order). Same are the results for GH17 with the difference that
PCA is instead of TfidfVectorizer.
Hmm, I would have expected OneHotEncoder somewhere at the top and much
less text processing. If there is real usage of CountVectorizer and
TfidfTransformer separately, then maybe deprecating TfidfVectorizer
could be done https://github.com/scikit-learn/scikit-learn/issues/14951
Though this ranking looks quite unexpected. I wonder if they have the
full list and not just the top5.
> Regarding learners, Top-5 in both GH17 and GH19 are
LogisticRegression, MultinomialNB, SVC, LinearRegression, and
RandomForestClassifier (in this order).
Maybe LinearRegression docstring should more strongly suggest to use
Ridge with small regularization in practice.
--
Roman
On 27/03/2020 17:32, Andreas Mueller wrote:
> Hey all.
> There's a pretty cool paper by a team at MS that analyses public github
> repos for their use of the sklearn and related libraries:
> https://arxiv.org/abs/1912.09536
>
> Thought it might be of interest.
>
> Cheers,
> Andy
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
More information about the scikit-learn
mailing list