Dear all, Kudos to scikit-learn! Having said that, Pipeline is killing me not being able to transform anything other than X. My current study case would need: - Transformers being able to handle both X and y, e.g. clustering X and y concatenated - Pipeline being able to change other params, e.g. sample_weight Currently, I'm augmenting X through every step with the extra information which seems to work ok for my_pipe.fit_transform(X_train,y_train) but breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I can inherit and modify a descendant from Pipeline class to allow the y parameter which is not ideal but I guess it is an option. The gritty part comes when having to adapt every regressor at the end of the ladder in order to split the extra information from the raw data in X and not being able to generate more than one subproduct from each preprocessing step My current research involves clustering the data and using that classification along with X in order to predict outliers which generates sample_weight info and I would love to use that on the final regressor. Currently there seems not to be another option than pasting that info on X. All in all, I'm stuck with this API limitation and I would love to learn some tricks from you if you could enlighten me. Thanks in advance! Manuel Castejón-Limas
Hey Manuel, In imbalanced-learn we have an extra type of estimators, named Samplers, which are able to modify X and y, at the same time, with the use of new API methods, sample and fit_sample. Also, we have adopted a modified version of scikit-learn's Pipeline class where we allow subsequent transformations using samplers and transformers. Despite the fact that the package deals with imbalanced datasets the aforementioned objects may help your pipeline. Cheerz, Chris On Tue, Dec 19, 2017 at 2:44 PM, Manuel Castejón Limas < manuel.castejon@gmail.com> wrote:
Dear all,
Kudos to scikit-learn! Having said that, Pipeline is killing me not being able to transform anything other than X.
My current study case would need: - Transformers being able to handle both X and y, e.g. clustering X and y concatenated - Pipeline being able to change other params, e.g. sample_weight
Currently, I'm augmenting X through every step with the extra information which seems to work ok for my_pipe.fit_transform(X_train,y_train) but breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I can inherit and modify a descendant from Pipeline class to allow the y parameter which is not ideal but I guess it is an option. The gritty part comes when having to adapt every regressor at the end of the ladder in order to split the extra information from the raw data in X and not being able to generate more than one subproduct from each preprocessing step
My current research involves clustering the data and using that classification along with X in order to predict outliers which generates sample_weight info and I would love to use that on the final regressor. Currently there seems not to be another option than pasting that info on X.
All in all, I'm stuck with this API limitation and I would love to learn some tricks from you if you could enlighten me.
Thanks in advance!
Manuel Castejón-Limas
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Wow, that seems promising. I'll read with interest the imbalance-learn code. Thanks for the info! Manuel 2017-12-19 14:15 GMT+01:00 Christos Aridas <ichkoar@gmail.com>:
Hey Manuel,
In imbalanced-learn we have an extra type of estimators, named Samplers, which are able to modify X and y, at the same time, with the use of new API methods, sample and fit_sample. Also, we have adopted a modified version of scikit-learn's Pipeline class where we allow subsequent transformations using samplers and transformers. Despite the fact that the package deals with imbalanced datasets the aforementioned objects may help your pipeline.
Cheerz, Chris
On Tue, Dec 19, 2017 at 2:44 PM, Manuel Castejón Limas < manuel.castejon@gmail.com> wrote:
Dear all,
Kudos to scikit-learn! Having said that, Pipeline is killing me not being able to transform anything other than X.
My current study case would need: - Transformers being able to handle both X and y, e.g. clustering X and y concatenated - Pipeline being able to change other params, e.g. sample_weight
Currently, I'm augmenting X through every step with the extra information which seems to work ok for my_pipe.fit_transform(X_train,y_train) but breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I can inherit and modify a descendant from Pipeline class to allow the y parameter which is not ideal but I guess it is an option. The gritty part comes when having to adapt every regressor at the end of the ladder in order to split the extra information from the raw data in X and not being able to generate more than one subproduct from each preprocessing step
My current research involves clustering the data and using that classification along with X in order to predict outliers which generates sample_weight info and I would love to use that on the final regressor. Currently there seems not to be another option than pasting that info on X.
All in all, I'm stuck with this API limitation and I would love to learn some tricks from you if you could enlighten me.
Thanks in advance!
Manuel Castejón-Limas
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
I think that you could you use imbalanced-learn regarding the issue that you have with the y. You should be able to wrap your clustering inside the FunctionSampler ( https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we are on the way to merge it) On 19 December 2017 at 13:44, Manuel Castejón Limas < manuel.castejon@gmail.com> wrote:
Dear all,
Kudos to scikit-learn! Having said that, Pipeline is killing me not being able to transform anything other than X.
My current study case would need: - Transformers being able to handle both X and y, e.g. clustering X and y concatenated - Pipeline being able to change other params, e.g. sample_weight
Currently, I'm augmenting X through every step with the extra information which seems to work ok for my_pipe.fit_transform(X_train,y_train) but breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I can inherit and modify a descendant from Pipeline class to allow the y parameter which is not ideal but I guess it is an option. The gritty part comes when having to adapt every regressor at the end of the ladder in order to split the extra information from the raw data in X and not being able to generate more than one subproduct from each preprocessing step
My current research involves clustering the data and using that classification along with X in order to predict outliers which generates sample_weight info and I would love to use that on the final regressor. Currently there seems not to be another option than pasting that info on X.
All in all, I'm stuck with this API limitation and I would love to learn some tricks from you if you could enlighten me.
Thanks in advance!
Manuel Castejón-Limas
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
Eager to learn! Diving on the code right now! Thanks for the tip! Manuel 2017-12-19 14:18 GMT+01:00 Guillaume Lemaître <g.lemaitre58@gmail.com>:
I think that you could you use imbalanced-learn regarding the issue that you have with the y. You should be able to wrap your clustering inside the FunctionSampler ( https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we are on the way to merge it)
On 19 December 2017 at 13:44, Manuel Castejón Limas < manuel.castejon@gmail.com> wrote:
Dear all,
Kudos to scikit-learn! Having said that, Pipeline is killing me not being able to transform anything other than X.
My current study case would need: - Transformers being able to handle both X and y, e.g. clustering X and y concatenated - Pipeline being able to change other params, e.g. sample_weight
Currently, I'm augmenting X through every step with the extra information which seems to work ok for my_pipe.fit_transform(X_train,y_train) but breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I can inherit and modify a descendant from Pipeline class to allow the y parameter which is not ideal but I guess it is an option. The gritty part comes when having to adapt every regressor at the end of the ladder in order to split the extra information from the raw data in X and not being able to generate more than one subproduct from each preprocessing step
My current research involves clustering the data and using that classification along with X in order to predict outliers which generates sample_weight info and I would love to use that on the final regressor. Currently there seems not to be another option than pasting that info on X.
All in all, I'm stuck with this API limitation and I would love to learn some tricks from you if you could enlighten me.
Thanks in advance!
Manuel Castejón-Limas
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
At a glance, and perhaps not knowing imbalanced-learn well enough, I have some doubts that it will provide an immediate solution for all your needs. At the end of the day, the Pipeline keeps its scope relatively tight, but it should not be so hard to implement something for your own needs if your case does not fit what Pipeline supports. On 20 December 2017 at 00:34, Manuel Castejón Limas < manuel.castejon@gmail.com> wrote:
Eager to learn! Diving on the code right now!
Thanks for the tip! Manuel
2017-12-19 14:18 GMT+01:00 Guillaume Lemaître <g.lemaitre58@gmail.com>:
I think that you could you use imbalanced-learn regarding the issue that you have with the y. You should be able to wrap your clustering inside the FunctionSampler ( https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we are on the way to merge it)
On 19 December 2017 at 13:44, Manuel Castejón Limas < manuel.castejon@gmail.com> wrote:
Dear all,
Kudos to scikit-learn! Having said that, Pipeline is killing me not being able to transform anything other than X.
My current study case would need: - Transformers being able to handle both X and y, e.g. clustering X and y concatenated - Pipeline being able to change other params, e.g. sample_weight
Currently, I'm augmenting X through every step with the extra information which seems to work ok for my_pipe.fit_transform(X_train,y_train) but breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I can inherit and modify a descendant from Pipeline class to allow the y parameter which is not ideal but I guess it is an option. The gritty part comes when having to adapt every regressor at the end of the ladder in order to split the extra information from the raw data in X and not being able to generate more than one subproduct from each preprocessing step
My current research involves clustering the data and using that classification along with X in order to predict outliers which generates sample_weight info and I would love to use that on the final regressor. Currently there seems not to be another option than pasting that info on X.
All in all, I'm stuck with this API limitation and I would love to learn some tricks from you if you could enlighten me.
Thanks in advance!
Manuel Castejón-Limas
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Just a quick ping to share that I've kept playing with this PipeGraph toy. The following example reflects its current state. * As you can see scikit-learn models can be used as steps in the nodes of the graph just by saying so, for example: 'Gaussian_Mixture': {'step': GaussianMixture, 'kargs': {'n_components': 3}, 'connections': {'X': ('Concatenate_Xy', 'Xy')}, 'use_for': ['fit'], }, * Custom steps need succint declarations with very little code * Graph description is nice to read, in my humble opinion. * Optional 'fit' and/or 'run' roles * TO-DO: Using memory option to cache and making it compatible with gridSearchCv. I was too busy playing with template methods in order to simplify its use. I have convinced some nice colleagues at my university to team up with me and write some nice documentation Best wishes Manolo import pandas as pd import numpy as np from sklearn.cluster import DBSCAN from sklearn.mixture import GaussianMixture from sklearn.linear_model import LinearRegression # work in progress library: https://github.com/mcasl/PAELLA/ from pipeGraph import (PipeGraph, FirstStep, LastStep, CustomStep) from paella import Paella URL = " https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_percent_no... " data = pd.read_csv(URL, usecols=['V1', 'V2']) X, y = data[['V1']], data[['V2']] class CustomConcatenationStep(CustomStep): def _post_fit(self): self.output['Xy'] = pd.concat(self.input, axis=1) class CustomCombinationStep(CustomStep): def _post_fit(self): self.output['classification'] = np.where(self.input['dominant'] < 0, self.input['dominant'], self.input['other']) class CustomPaellaStep(CustomStep): def _pre_fit(self): self.sklearn_object = Paella(**self.kargs) def _fit(self): self.sklearn_object.fit(**self.input) def _post_fit(self): self.output['prediction'] = self.sklearn_object.transform(self.input['X'], self.input['y']) graph_description = { 'First': {'step': FirstStep, 'connections': {'X': X, 'y': y}, 'use_for': ['fit', 'run'], }, 'Concatenate_Xy': {'step': CustomConcatenationStep, 'connections': {'df1': ('First', 'X'), 'df2': ('First', 'y')}, 'use_for': ['fit'], }, 'Gaussian_Mixture': {'step': GaussianMixture, 'kargs': {'n_components': 3}, 'connections': {'X': ('Concatenate_Xy', 'Xy')}, 'use_for': ['fit'], }, 'Dbscan': {'step': DBSCAN, 'kargs': {'eps': 0.05}, 'connections': {'X': ('Concatenate_Xy', 'Xy')}, 'use_for': ['fit'], }, 'Combine_Clustering': {'step': CustomCombinationStep, 'connections': {'dominant': ('Dbscan', 'prediction'), 'other': ('Gaussian_Mixture', 'prediction')}, 'use_for': ['fit'], }, 'Paella': {'step': CustomPaellaStep, 'kargs': {'noise_label': -1, 'max_it': 20, 'regular_size': 400, 'minimum_size': 100, 'width_r': 0.99, 'n_neighbors': 5, 'power': 30, 'random_state': None}, 'connections': {'X': ('First', 'X'), 'y': ('First', 'y'), 'classification': ('Combine_Clustering', 'classification')}, 'use_for': ['fit'], }, 'Regressor': {'step': LinearRegression, 'kargs': {}, 'connections': {'X': ('First', 'X'), 'y': ('First', 'y'), 'sample_weight': ('Paella', 'prediction')}, 'use_for': ['fit', 'run'], }, 'Last': {'step': LastStep, 'connections': {'prediction': ('Regressor', 'prediction'), }, 'use_for': ['fit', 'run'], }, } pipegraph = PipeGraph(graph_description) pipegraph.fit() #Fitting: First #Fitting: Concatenate_Xy #Fitting: Dbscan #Fitting: Gaussian_Mixture #Fitting: Combine_Clustering #Fitting: Paella #0 , #1 , #2 , #3 , #4 , #5 , #6 , #7 , #8 , #9 , #10 , #11 , #12 , #13 , #14 , #15 , #16 , #17 , #18 , #19 , #Fitting: Regressor #Fitting: Last pipegraph.run() #Running: First #Running: Regressor #Running: Last 2017-12-19 13:44 GMT+01:00 Manuel Castejón Limas <manuel.castejon@gmail.com> :
Dear all,
Kudos to scikit-learn! Having said that, Pipeline is killing me not being able to transform anything other than X.
My current study case would need: - Transformers being able to handle both X and y, e.g. clustering X and y concatenated - Pipeline being able to change other params, e.g. sample_weight
Currently, I'm augmenting X through every step with the extra information which seems to work ok for my_pipe.fit_transform(X_train,y_train) but breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I can inherit and modify a descendant from Pipeline class to allow the y parameter which is not ideal but I guess it is an option. The gritty part comes when having to adapt every regressor at the end of the ladder in order to split the extra information from the raw data in X and not being able to generate more than one subproduct from each preprocessing step
My current research involves clustering the data and using that classification along with X in order to predict outliers which generates sample_weight info and I would love to use that on the final regressor. Currently there seems not to be another option than pasting that info on X.
All in all, I'm stuck with this API limitation and I would love to learn some tricks from you if you could enlighten me.
Thanks in advance!
Manuel Castejón-Limas
participants (4)
-
Christos Aridas -
Guillaume Lemaître -
Joel Nothman -
Manuel Castejón Limas