Get parameters of classes in a Pipeline within cross_validate
Hi scikit-learners, I have a simple Pipeline with Feature Selection and SVC classifier and I use it in a cross validation schema with cross_validate / cross_validation_score functions. I need to extract the selected features for each fold of the CV and in general get information about the fitted elements of the pipeline in each of the CV fold. Is there a way to get these information (e.g. fs.get_support() or fs.scores_) or I need to build my own cross_validate function? Thank you, Roberto -- Ing. Roberto Guidotti, PhD. PostDoc Fellow Institute for Advanced Biomedical Technologies - ITAB Department of Neuroscience and Imaging University of Chieti "G. D'Annunzio" Via dei Vestini, 33 66013 Chieti, Italy tel: +39 0871 3556919 e-mail: r.guidotti@unich.it; rguidotti@acm.org linkedin: http://it.linkedin.com/in/robertogui/ twitter: @robbisg github: https://github.com/robbisg
Hi all, We are currently trying to add to the metric-learn package (https://github.com/metric-learn/metric-learn) a feature that would allow to do cross-validation with Weakly Supervised Metric Learners using scikit-learn's cross-validation routines. Distance Metric Learning algorithms learn distance metrics between samples, using some supervised information about similarity between training samples. Some Metric Learning algorithms are weakly supervised (Weakly Supervised Metric Learners), i.e. they do not train on labeled samples, but for instance on labeled *pairs* of samples (the label telling whether the pair is of similar or dissimilar samples). To cross-validate these algorithms, we make a train and a test by splitting on the pairs. Indeed a use case of metric learning is to classify at test time unseen pairs as similar or dissimilar (those pairs can involve already seen samples). For that, we made a dataset representation that allows to easily slice on pairs of samples: we mock a 3D array containing pairs of samples, that would be of shape (n_constraints, 2, n_features) (each line is a pair of samples). We do so with an object that we called ConstrainedDataset, which is more memory efficient than the described array (because samples would be duplicated through pairs). Now we have a problem when running scikit-learn's *check_estimator* on these algorithms, because it launches a series of tests where the estimator takes as input regular arrays, whereas Weakly Supervised Metric Learners always learn on ConstrainedDatasets (or more generally on pairs, or tuples for some other algorithms). We therefore thought of two main possibilities (that could be combined) to solve this problem: - taking the maximum number of tests yielded by check_estimator that pass in our setting, and modifying the others by replacing array inputs with ConstrainedDatasets - wrapping a Weakly Supervised Metric Learner into a MockSklearnEstimator that would transform any array as input into a ConstrainedDataset before passing it to the underlying Weakly Supervised Metric Learner However these options are not really satisfying: the first one will create a lot of code and after that one cannot see at a glance if the estimator passes scikit-learn's check_estimator, and the second adds so much wrapping that we are not even really testing the Weakly Supervised Metric Learner) For more information, see this PR where the new feature is being implemented, including the constraints.ConstrainedDataset object, as well as a comment on what is problematic when using scikit-learn's check_estimator: https://github.com/metric-learn/metric-learn/pull/85#issuecomment-375659820 Any advice about how to design the weakly supervised algorithms, the data structure containing the pairs of samples, or how to use anyway scikit-learn's check_estimator would be appreciated! Thanks! Best regards, William
Hi Roberto, One option it could be to make a wrapper and serialize your pipeline in your wrapper's fit method. After the serialization you could load the pipeline anytime and inspect whatever you want. I have coded an example in the following gist. https://gist.github.com/chkoar/2993a6e3f6bae1887eabc3fa27bb06a6 Best, Chris On Thu, Mar 29, 2018 at 12:16 PM, Roberto Guidotti <robbenson18@gmail.com> wrote:
Hi scikit-learners,
I have a simple Pipeline with Feature Selection and SVC classifier and I use it in a cross validation schema with cross_validate / cross_validation_score functions. I need to extract the selected features for each fold of the CV and in general get information about the fitted elements of the pipeline in each of the CV fold.
Is there a way to get these information (e.g. fs.get_support() or fs.scores_) or I need to build my own cross_validate function?
Thank you, Roberto
-- Ing. Roberto Guidotti, PhD. PostDoc Fellow Institute for Advanced Biomedical Technologies - ITAB Department of Neuroscience and Imaging University of Chieti "G. D'Annunzio" Via dei Vestini, 33 66013 Chieti, Italy tel: +39 0871 3556919 e-mail: r.guidotti@unich.it; rguidotti@acm.org linkedin: http://it.linkedin.com/in/robertogui/ twitter: @robbisg github: https://github.com/robbisg
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Chris, Cool! I will try it very soon!! Thank you Roberto On 2 April 2018 at 01:47, Chris Aridas <chris@aridas.eu> wrote:
Hi Roberto,
One option it could be to make a wrapper and serialize your pipeline in your wrapper's fit method. After the serialization you could load the pipeline anytime and inspect whatever you want. I have coded an example in the following gist.
https://gist.github.com/chkoar/2993a6e3f6bae1887eabc3fa27bb06a6
Best, Chris
On Thu, Mar 29, 2018 at 12:16 PM, Roberto Guidotti <robbenson18@gmail.com> wrote:
Hi scikit-learners,
I have a simple Pipeline with Feature Selection and SVC classifier and I use it in a cross validation schema with cross_validate / cross_validation_score functions. I need to extract the selected features for each fold of the CV and in general get information about the fitted elements of the pipeline in each of the CV fold.
Is there a way to get these information (e.g. fs.get_support() or fs.scores_) or I need to build my own cross_validate function?
Thank you, Roberto
-- Ing. Roberto Guidotti, PhD. PostDoc Fellow Institute for Advanced Biomedical Technologies - ITAB Department of Neuroscience and Imaging University of Chieti "G. D'Annunzio" Via dei Vestini, 33 66013 Chieti, Italy tel: +39 0871 3556919 <0871%20355%206919> e-mail: r.guidotti@unich.it; rguidotti@acm.org linkedin: http://it.linkedin.com/in/robertogui/ twitter: @robbisg github: https://github.com/robbisg
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Ing. Roberto Guidotti, PhD. PostDoc Fellow Institute for Advanced Biomedical Technologies - ITAB Department of Neuroscience and Imaging University of Chieti "G. D'Annunzio" Via dei Vestini, 33 66013 Chieti, Italy tel: +39 0871 3556919 e-mail: r.guidotti@unich.it; rguidotti@acm.org linkedin: http://it.linkedin.com/in/robertogui/ twitter: @robbisg github: https://github.com/robbisg
This is implemented in the current development version: https://github.com/scikit-learn/scikit-learn/pull/9686 On 03/29/2018 05:16 AM, Roberto Guidotti wrote:
Hi scikit-learners,
I have a simple Pipeline with Feature Selection and SVC classifier and I use it in a cross validation schema with cross_validate / cross_validation_score functions. I need to extract the selected features for each fold of the CV and in general get information about the fitted elements of the pipeline in each of the CV fold.
Is there a way to get these information (e.g. fs.get_support() or fs.scores_) or I need to build my own cross_validate function?
Thank you, Roberto
-- Ing. Roberto Guidotti, PhD. PostDoc Fellow Institute for Advanced Biomedical Technologies - ITAB Department of Neuroscience and Imaging University of Chieti "G. D'Annunzio" Via dei Vestini, 33 66013 Chieti, Italy tel: +39 0871 3556919 e-mail: r.guidotti@unich.it <mailto:r.guidotti@unich.it>; rguidotti@acm.org <mailto:rguidotti@acm.org> linkedin: http://it.linkedin.com/in/robertogui/ twitter: @robbisg github: https://github.com/robbisg
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (4)
-
Andreas Mueller -
Chris Aridas -
Roberto Guidotti -
wdevazel