[scikit-learn] Pipegraph example: KMeans + LDA
Manuel Castejón Limas
manuel.castejon at gmail.com
Mon Oct 29 11:08:01 EDT 2018
The long story short: Thank you for your time & sorry for inaccuracies; a
few words selling a modular approach to your developments; and a request on
your opinion on parallelizing Pipegraph using dask.
Thank you Andreas for your patience showing me the sklearn ways. I admit
that I'm still learning scikit-learn capabilities which is a tough thing as
you all continue improving the library as in this new release. Keep up the
good work with your developments and your teaching to the community. In
particular, I learned A LOT with your answer. Big thanks!
I'm inlining my comments:
> - KMeans is not a transformer but an estimator
> KMeans is a transformer in sklearn:
> (you can't get the labels to be the output which is what you're doing
> here, but it is a transformer)
My bad! I saw the predict method and did not check the source code: It is
true, from the code
class KMeans(BaseEstimator, ClusterMixin, TransformerMixin):
The point was that, as you guessed, one cannot put it in a pipeline a
KMeans followed by a LDA just like that without additional efforts.
> - LDA score function requires the y parameter, while its input does not
> come from a known set of labels, but from the previous KMeans
> - Moreover, the GridSearchCV.fit call would also require a 'y' parameter
> Not true if you provide a scoring that doesn't require y or if you don't
> specify scoring and the scoring method of the estimator doesn't require y.
> GridSearchCV.fit doesn't require y.
My bad again. I wanted to mean that without the scoring function and the CV
iterator that you use below, gridsearchCV will call the scoring function
of the final step, i.e. LDA, and LDA scoring function wants a y. But
please, bear with me, I simply did not know the proper hacks.The test does
not lie, I'm quite a newbie then :-)
Can you provide a citation for that? That seems to heavily depend on the
> clustering algorithms and the classifier.
> To me, stability scoring seems more natural:
> Good to know, thank you for the reference. You are right about the
dependance, it's all about the nature of the clustering and the classifier;
but I was just providing a scenario, not necessarily advocating for this
strategy as the solution to the number of clusters question.
It's cool that this is possible, but I feel this is still not really a
> "killer application" in that this is not a very common pattern.
IMHO, the beauty of the example, if there is any :-D, was the simplicity
and brevity. I agree that it is not a killer application, just a possible
> Though I acknowledge that your code only takes 4 lines, while mine takes 8
> (thought if we'd add NoSplitCV to sklearn mine would also only take 4 lines
I think pipegraph is cool, not meaning to give you a hard time ;)
Thank you again for your time. The thing is that I believe PipeGraph can be
useful for you in terms of approaching your models following a modular
approach. I'm going to work on a second example implementing something
similar to the VotingClassifier class to show you the approach.
The main weakness is the lack of parallelism in the inner working of
PipeGraph, which was never a concern for me since as far as GridSearchCV
can parallelize the training I was ok with that grain size. But, now, I
reckon that paralellization can be useful for you in term of approaching
your models as a PipeGraph and having Parallelization for free without
having to directly call joblib (thank you joblib authors for such goodie).
I guess that providing a dask backend for pipegraph would be nice. But let
me continue with this issue after sending the VotingClassifier example :-)
Thanks, truly, I need to study hard!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the scikit-learn