How to not recalculate transformer in a Pipeline?
Hello! I use a 2-step Pipeline with an expensive transformer and a classification afterwards. On this I do GridSearchCV of the classifcation parameters. Now, theoretically GridSearchCV could know that I'm not touching any parameters of the transformer and avoid re-doing work by keeping the transformed X, right?! Currently, GridSearchCV will do a clean re-run of all Pipeline steps? Can you recommend the easiest way for me to use GridSearchCV+Pipeline while avoiding recomputation of all transformer steps whose parameters are not in the GridSearch? I realize this may be tricky, but any pointers to realize this most conveniently and compatible with sklearn would be highly appreciated! (The scoring has to be done on the initial data, so I cannot just manually transform beforehand.) Regards, Anton PS: If that all makes sense, is that a useful feature to include in sklearn?
I use joblib.Memory for this purpose. I think that including a meta-transformer that embeds a joblib.Memory would be a good addition to scikit-learn.
On 11/28/2016 10:46 AM, Gael Varoquaux wrote:
I use joblib.Memory for this purpose. I think that including a meta-transformer that embeds a joblib.Memory would be a good addition to scikit-learn. To cache the result of "transform"? You still have to call "fit" multiple times, right? Or would you cache the return of "fit" as well as "transform"? Caching "fit" with joblib seems non-trivial.
Or would you cache the return of "fit" as well as "transform"?
Caching fit rather than transform. Fit is usually the costly step.
Caching "fit" with joblib seems non-trivial.
Why? Caching a function that takes the estimator and X and y should do it. The transformer would clone the estimator on fit, to avoid side-effects that would trigger recomputes. It's a pattern that I use often, I've just never coded a good transformer for it. On my usecases, it works very well, provided that everything is nicely seeded. Also, the persistence across sessions is a real time saver.
On 11/28/2016 12:15 PM, Gael Varoquaux wrote:
Or would you cache the return of "fit" as well as "transform"? Caching fit rather than transform. Fit is usually the costly step.
Caching "fit" with joblib seems non-trivial. Why? Caching a function that takes the estimator and X and y should do it. The transformer would clone the estimator on fit, to avoid side-effects that would trigger recomputes. I guess so. You'd handle parameters using an estimator_params dict in init and pass that to the caching function?
It's a pattern that I use often, I've just never coded a good transformer for it.
On my usecases, it works very well, provided that everything is nicely seeded. Also, the persistence across sessions is a real time saver. Yeah for sure :)
On Mon, Nov 28, 2016 at 01:46:08PM -0500, Andreas Mueller wrote:
I guess so. You'd handle parameters using an estimator_params dict in init and pass that to the caching function?
I'd try to set on the estimator, before passing them to the function, as we do in standard scikit-learn, and joblib is clever enough to take that in account when given the estimator as a function of the function that is memoized. G
Actually, thinking a bit about this, the inconvenience with the pattern that I lay out below is that it adds an extra indirection in the parameter setting. One way to avoid this would be to have a subclass of the pipeline that includes memoizing. It would call a memoized version of fit. I think that it would be quite handy :). Should I open an issue on that? G On Mon, Nov 28, 2016 at 07:51:21PM +0100, Gael Varoquaux wrote:
On Mon, Nov 28, 2016 at 01:46:08PM -0500, Andreas Mueller wrote:
I guess so. You'd handle parameters using an estimator_params dict in init and pass that to the caching function?
I'd try to set on the estimator, before passing them to the function, as we do in standard scikit-learn, and joblib is clever enough to take that in account when given the estimator as a function of the function that is memoized.
G _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
A few brief points of history: - We have had PRs #3951 <https://github.com/scikit-learn/scikit-learn/pull/3951> and #2086 <https://github.com/scikit-learn/scikit-learn/pull/2086> that build memoising into Pipeline in one way or another. - Andy and I have previously discussed alternative ways to set parameters to avoid indirection issues created by wrappers. This can be achieved by setting the parameter space on the estimator itself, or by indicating parameters to *SearchCV shallowly with respect to an estimator instance, rather than using an indirected path. See #5082 <https://github.com/scikit-learn/scikit-learn/issues/5082>. - The indirection is in parameter setting as well as in retrieving model attributes. My remember branch <https://github.com/jnothman/scikit-learn/commit/76cace9f104a575116492bea1a23...> gets around both indirections in creating a remember_transform wrapper, but it does so by hacking clone (as per #5080 <https://github.com/scikit-learn/scikit-learn/issues/5080>), and doing some other magic. On 29 November 2016 at 09:17, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
Actually, thinking a bit about this, the inconvenience with the pattern that I lay out below is that it adds an extra indirection in the parameter setting. One way to avoid this would be to have a subclass of the pipeline that includes memoizing. It would call a memoized version of fit.
I think that it would be quite handy :).
Should I open an issue on that?
G
On Mon, Nov 28, 2016 at 07:51:21PM +0100, Gael Varoquaux wrote:
On Mon, Nov 28, 2016 at 01:46:08PM -0500, Andreas Mueller wrote:
I guess so. You'd handle parameters using an estimator_params dict in init and pass that to the caching function?
I'd try to set on the estimator, before passing them to the function, as we do in standard scikit-learn, and joblib is clever enough to take that in account when given the estimator as a function of the function that is memoized.
G _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On Tue, Nov 29, 2016 at 10:13:00AM +1100, Joel Nothman wrote:
- We have had PRs #3951 <https://github.com/scikit-learn/scikit-learn/pull/3951> and #2086 <https://github.com/scikit-learn/scikit-learn/pull/2086> that build memoising into Pipeline in one way or another.
Sorry, I had in mind that this was discussed, but I hadn't realized that they were PRs. I think that 3951 is a good start. I would have comments on it, but maybe I should make them in the PR.
- Andy and I have previously discussed alternative ways to set parameters to avoid indirection issues created by wrappers.
I feel that these approaches are much more invasive. The nice thing about a memoized pipeline is that is a a fairly local change. I'll comment on 3951 in terms of this specific realization, but we can discuss here if we want to take it further. Gaël
But that the issue of model memoising isn't limited to pipeline. On 29 November 2016 at 18:11, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
On Tue, Nov 29, 2016 at 10:13:00AM +1100, Joel Nothman wrote:
- We have had PRs #3951 <https://github.com/scikit-learn/scikit-learn/pull/3951> and #2086 <https://github.com/scikit-learn/scikit-learn/pull/2086> that build memoising into Pipeline in one way or another.
Sorry, I had in mind that this was discussed, but I hadn't realized that they were PRs. I think that 3951 is a good start. I would have comments on it, but maybe I should make them in the PR.
- Andy and I have previously discussed alternative ways to set parameters to avoid indirection issues created by wrappers.
I feel that these approaches are much more invasive. The nice thing about a memoized pipeline is that is a a fairly local change.
I'll comment on 3951 in terms of this specific realization, but we can discuss here if we want to take it further.
Gaël _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hey Anton. Yes, that would be great to have. There is no solution implemented in scikit-learn right now, but there are at least two ways that I know of. This (ancient and probably now defunct) pr: https://github.com/scikit-learn/scikit-learn/pull/3951 And using dask: http://matthewrocklin.com/blog/work/2016/07/12/dask-learn-part-1 Andy On 11/28/2016 10:24 AM, Anton Suchaneck wrote:
Hello!
I use a 2-step Pipeline with an expensive transformer and a classification afterwards. On this I do GridSearchCV of the classifcation parameters.
Now, theoretically GridSearchCV could know that I'm not touching any parameters of the transformer and avoid re-doing work by keeping the transformed X, right?! Currently, GridSearchCV will do a clean re-run of all Pipeline steps?
Can you recommend the easiest way for me to use GridSearchCV+Pipeline while avoiding recomputation of all transformer steps whose parameters are not in the GridSearch? I realize this may be tricky, but any pointers to realize this most conveniently and compatible with sklearn would be highly appreciated!
(The scoring has to be done on the initial data, so I cannot just manually transform beforehand.)
Regards, Anton
PS: If that all makes sense, is that a useful feature to include in sklearn?
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
účastníci (4)
-
Andreas Mueller -
Anton Suchaneck -
Gael Varoquaux -
Joel Nothman