Dear all, after some playing with the concept we have developed a module for implementing the functionality of Pipeline in more general contexts as first introduced in a former thread ( https://mail.python.org/ pipermail/scikit-learn/2018-January/002158.html ) In order to expand the possibilities of Pipeline for non linearly sequential workflows a graph like structure has been deployed while keeping as much as possible the already known syntax we all love and honor: X = pd.DataFrame(dict(X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])) y = 2 * X sc = MinMaxScaler() lm = LinearRegression() steps = [('scaler', sc), ('linear_model', lm)] connections = {'scaler': dict(X='X'), 'linear_model': dict(X=('scaler', 'predict'), y='y')} pgraph = PipeGraph(steps=steps, connections=connections, use_for_fit='all', use_for_predict='all') As you can see the biggest difference for the final user is the dictionary describing the connections. Another major contribution for developers wanting to expand scikit learn is a collection of adapters for scikit learn models in order to provide them a common API irrespectively of whether they originally implemented predict, transform or fit_predict as an atomic operation without predict. These adapters accept as many positional or keyword parameters in their fit predict methods through *pargs and **kwargs. As general as PipeGraph is, it cannot work under the restrictions imposed by GridSearchCV on the input parameters, namely X and y since PipeGraph can accept as many input signals as needed. Thus, an adhoc GridSearchCv version is also needed and we will provide a basic initial version in a later version. We need to write the documentation and we will propose it as a contrib-project in a few days. Best wishes, Manuel Castejón-Limas
cool! We have been talking for a while about how to pass other things around grid search and other meta-analysis estimators. This injection approach looks pretty neat as a way to express it. Will need to mull on it. On 8 Feb 2018 2:51 am, "Manuel Castejón Limas" <manuel.castejon@gmail.com> wrote:
Dear all,
after some playing with the concept we have developed a module for implementing the functionality of Pipeline in more general contexts as first introduced in a former thread ( https://mail.python.org/piperm ail/scikit-learn/2018-January/002158.html )
In order to expand the possibilities of Pipeline for non linearly sequential workflows a graph like structure has been deployed while keeping as much as possible the already known syntax we all love and honor:
X = pd.DataFrame(dict(X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])) y = 2 * X sc = MinMaxScaler() lm = LinearRegression() steps = [('scaler', sc), ('linear_model', lm)] connections = {'scaler': dict(X='X'), 'linear_model': dict(X=('scaler', 'predict'), y='y')} pgraph = PipeGraph(steps=steps, connections=connections, use_for_fit='all', use_for_predict='all')
As you can see the biggest difference for the final user is the dictionary describing the connections.
Another major contribution for developers wanting to expand scikit learn is a collection of adapters for scikit learn models in order to provide them a common API irrespectively of whether they originally implemented predict, transform or fit_predict as an atomic operation without predict. These adapters accept as many positional or keyword parameters in their fit predict methods through *pargs and **kwargs.
As general as PipeGraph is, it cannot work under the restrictions imposed by GridSearchCV on the input parameters, namely X and y since PipeGraph can accept as many input signals as needed. Thus, an adhoc GridSearchCv version is also needed and we will provide a basic initial version in a later version.
We need to write the documentation and we will propose it as a contrib-project in a few days.
Best wishes, Manuel Castejón-Limas
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Thanks Manuel, that looks pretty cool. Do you have a write-up about it? I don't entirely understand the connections setup.
Docs are coming soon. In the meantime , Imagine a first step containing a TrainTestSplit class with a similar behaviour to train_test_split but capable of producing results by using fit and predict (this is a goodie). The inputs will be X, y, z, ... , and the outputs the same names + _train and _test. A second step could be a MinMaxScaler taking only X_train. A third step a linear model using the output from MinMaxScaler as X. This would be written: connections['split'] = {'A': 'X', 'B': 'y'} Meaning that the 'split' step will use the X and y from the fit or predict call calling them A and B internally. If you use, for instance, my_pipegraph.fit(X=myX, y=myY) This step will produce A_train with a piece of myX You can use this later: connections['scaler'] = { 'X': ('split', 'A_train')} Expressing that the output A_train from the split step will be use as input X for the scaler. The output from this step is called 'predict' Finally, for the third step: connections['linear_model'] ={'X': ('scaler', 'predict'), 'y': ('split', 'B_train')} Notice, that if we are talking about an external input variable we don't use a tuple. So the syntax is something like connection[step_label] = {internal_variable: (input_step, variable_there)} Docs are coming anyway. Travis CI, Circle CI and Appveyor have been successfully activated at GitHub.com/mcasl/PipeGraph Sorry if you found mistypos, I use my smartphone for replying. Best Manuel El 7 feb. 2018 11:32 p. m., "Andreas Mueller" <t3kcit@gmail.com> escribió:
Thanks Manuel, that looks pretty cool. Do you have a write-up about it? I don't entirely understand the connections setup. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
While we keep working on the docs and figures, here is a little example you all can already run: import numpy as np import pandas as pd from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.model_selection import GridSearchCV from pipegraph.pipeGraph import PipeGraphClassifier, Concatenator import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.neural_network import MLPClassifier iris = load_iris() X = iris.data y = iris.target scaler = MinMaxScaler() gaussian_nb = GaussianNB() svc = SVC() mlp = MLPClassifier() concatenator = Concatenator() steps = [('scaler', scaler), ('gaussian_nb', gaussian_nb), ('svc', svc), ('concat', concatenator), ('mlp', mlp)] connections = { 'scaler': {'X': 'X'}, 'gaussian_nb': {'X': ('scaler', 'predict'), 'y': 'y'}, 'svc': {'X': ('scaler', 'predict'), 'y': 'y'}, 'concat': {'X1': ('scaler', 'predict'), 'X2': ('gaussian_nb', 'predict'), 'X3': ('svc', 'predict')}, 'mlp': {'X': ('concat', 'predict'), 'y': 'y'} } param_grid = {'svc__C': [0.1, 0.5, 1.0], 'mlp__hidden_layer_sizes': [(3,), (6,), (9,),], 'mlp__max_iter': [5000, 10000]} pgraph = PipeGraphClassifier(steps=steps, connections=connections) grid_search_classifier = GridSearchCV(estimator=pgraph, param_grid=param_grid, refit=True) grid_search_classifier.fit(X, y) y_pred = grid_search_classifier.predict(X) grid_search_regressor.best_estimator_.get_params() --- 'predict' is the default output name. One of these days we will simplify the notation to simply the name of the node in case of default output names. Best wishes Manuel 2018-02-07 23:29 GMT+01:00 Andreas Mueller <t3kcit@gmail.com>:
Thanks Manuel, that looks pretty cool. Do you have a write-up about it? I don't entirely understand the connections setup. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Very cool! Thanks for all the great work. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD www.andrewhowe.com http://orcid.org/0000-0002-3553-1990 http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Wed, Feb 7, 2018 at 6:49 PM, Manuel Castejón Limas < manuel.castejon@gmail.com> wrote:
Dear all,
after some playing with the concept we have developed a module for implementing the functionality of Pipeline in more general contexts as first introduced in a former thread ( https://mail.python.org/piperm ail/scikit-learn/2018-January/002158.html )
In order to expand the possibilities of Pipeline for non linearly sequential workflows a graph like structure has been deployed while keeping as much as possible the already known syntax we all love and honor:
X = pd.DataFrame(dict(X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])) y = 2 * X sc = MinMaxScaler() lm = LinearRegression() steps = [('scaler', sc), ('linear_model', lm)] connections = {'scaler': dict(X='X'), 'linear_model': dict(X=('scaler', 'predict'), y='y')} pgraph = PipeGraph(steps=steps, connections=connections, use_for_fit='all', use_for_predict='all')
As you can see the biggest difference for the final user is the dictionary describing the connections.
Another major contribution for developers wanting to expand scikit learn is a collection of adapters for scikit learn models in order to provide them a common API irrespectively of whether they originally implemented predict, transform or fit_predict as an atomic operation without predict. These adapters accept as many positional or keyword parameters in their fit predict methods through *pargs and **kwargs.
As general as PipeGraph is, it cannot work under the restrictions imposed by GridSearchCV on the input parameters, namely X and y since PipeGraph can accept as many input signals as needed. Thus, an adhoc GridSearchCv version is also needed and we will provide a basic initial version in a later version.
We need to write the documentation and we will propose it as a contrib-project in a few days.
Best wishes, Manuel Castejón-Limas
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (4)
-
Andreas Mueller -
Andrew Howe -
Joel Nothman -
Manuel Castejón Limas