[scikit-learn] Pipegraph example: KMeans + LDA

Andreas Mueller t3kcit at gmail.com
Sun Oct 28 22:13:36 EDT 2018

On 10/24/18 4:11 AM, Manuel Castejón Limas wrote:
> Dear all,
> as a way of improving the documentation of PipeGraph we intend to 
> provide more examples of its usage. It was a popular demand to show 
> application cases to motivate its usage, so here it is a very simple 
> case with two steps: a KMeans followed by a LDA.
> https://mcasl.github.io/PipeGraph/auto_examples/plot_Finding_Number_of_clusters.html#sphx-glr-auto-examples-plot-finding-number-of-clusters-py
> This short example points out the following challenges:
> - KMeans is not a transformer but an estimator

KMeans is a transformer in sklearn: 

(you can't get the labels to be the output which is what you're doing 
here, but it is a transformer)

> - LDA score function requires the y parameter, while its input does 
> not come from a known set of labels, but from the previous KMeans
> - Moreover, the GridSearchCV.fit call would also require a 'y' parameter

Not true if you provide a scoring that doesn't require y or if you don't 
specify scoring and the scoring method of the estimator doesn't require y.

GridSearchCV.fit doesn't require y.

> - It would be nice to have access to the output of the KMeans step as 
> well.
> PipeGraph is capable of addressing these challenges.
> The rationale for this example lies in the 
> identification-reconstruction realm. In a scenario where the class 
> labels are unknown, we might want to associate the quality of the 
> clustering structure to the capability of a later model to be able to 
> reconstruct this structure. So the basic idea here is that if LDA is 
> capable of getting good results it was because the information of the 
> KMeans was good enough for that purpose, hinting the discovery of a 
> good structure.
Can you provide a citation for that? That seems to heavily depend on the 
clustering algorithms and the classifier.
To me, stability scoring seems more natural: https://arxiv.org/abs/1007.1075

This does seem interesting as well, though, haven't thought about this.

It's cool that this is possible, but I feel this is still not really a 
"killer application" in that this is not a very common pattern.

Also you could replicate something similar in sklearn with

def estimator_scorer(testing_estimator):
     def my_scorer(estimator, X, y=None):
         y = estimator.predict(X)

         return np.mean(cross_val_score(testing_estimator, X, y))

Though using that we'd be doing nested cross-validation on the test set...
That's a bit of an issue in the current GridSearchCV implementation :-/ 
There's an issue by Joel somewhere
to implement something that allows training without splitting which is 
what you'd want here.
You could run the outer grid-search with a custom cross-validation 
iterator that returns all indices as training and test set and only does 
a single split, though...

class NoSplitCV(object):

     def split(self, X, y, class_weights):

         indices = np.arange(_num_samples(X))
         yield indices, indices

Though I acknowledge that your code only takes 4 lines, while mine takes 
8 (thought if we'd add NoSplitCV to sklearn mine would also only take 4 
lines :P)

I think pipegraph is cool, not meaning to give you a hard time ;)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181028/e2ce6386/attachment-0001.html>

More information about the scikit-learn mailing list