[scikit-learn] Pipegraph example: KMeans + LDA
t3kcit at gmail.com
Sun Oct 28 22:13:36 EDT 2018
On 10/24/18 4:11 AM, Manuel Castejón Limas wrote:
> Dear all,
> as a way of improving the documentation of PipeGraph we intend to
> provide more examples of its usage. It was a popular demand to show
> application cases to motivate its usage, so here it is a very simple
> case with two steps: a KMeans followed by a LDA.
> This short example points out the following challenges:
> - KMeans is not a transformer but an estimator
KMeans is a transformer in sklearn:
(you can't get the labels to be the output which is what you're doing
here, but it is a transformer)
> - LDA score function requires the y parameter, while its input does
> not come from a known set of labels, but from the previous KMeans
> - Moreover, the GridSearchCV.fit call would also require a 'y' parameter
Not true if you provide a scoring that doesn't require y or if you don't
specify scoring and the scoring method of the estimator doesn't require y.
GridSearchCV.fit doesn't require y.
> - It would be nice to have access to the output of the KMeans step as
> PipeGraph is capable of addressing these challenges.
> The rationale for this example lies in the
> identification-reconstruction realm. In a scenario where the class
> labels are unknown, we might want to associate the quality of the
> clustering structure to the capability of a later model to be able to
> reconstruct this structure. So the basic idea here is that if LDA is
> capable of getting good results it was because the information of the
> KMeans was good enough for that purpose, hinting the discovery of a
> good structure.
Can you provide a citation for that? That seems to heavily depend on the
clustering algorithms and the classifier.
To me, stability scoring seems more natural: https://arxiv.org/abs/1007.1075
This does seem interesting as well, though, haven't thought about this.
It's cool that this is possible, but I feel this is still not really a
"killer application" in that this is not a very common pattern.
Also you could replicate something similar in sklearn with
def my_scorer(estimator, X, y=None):
y = estimator.predict(X)
return np.mean(cross_val_score(testing_estimator, X, y))
Though using that we'd be doing nested cross-validation on the test set...
That's a bit of an issue in the current GridSearchCV implementation :-/
There's an issue by Joel somewhere
to implement something that allows training without splitting which is
what you'd want here.
You could run the outer grid-search with a custom cross-validation
iterator that returns all indices as training and test set and only does
a single split, though...
def split(self, X, y, class_weights):
indices = np.arange(_num_samples(X))
yield indices, indices
Though I acknowledge that your code only takes 4 lines, while mine takes
8 (thought if we'd add NoSplitCV to sklearn mine would also only take 4
I think pipegraph is cool, not meaning to give you a hard time ;)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the scikit-learn