[scikit-learn] Any plans on generalizing Pipeline and transformers?

Manuel Castejón Limas manuel.castejon at gmail.com
Tue Dec 26 05:47:47 EST 2017


I'm elaborating on the graph idea. A dictionary to describe the graph, the
networkx package to support the graph and run it in topological order; and
some wrappers for scikit-learn models.

I'm currently thinking on putting some more efforts into a contrib project.

It could be something inspired by this example.

Manolo

#-------------------------------------------------



graph_description = {
              'First':
                  {'operation': First_Step,
                   'input': {'X':X, 'y':y}},

              'Concatenate_Xy':
                  {'operation': ConcatenateData_Step,
                   'input': [('First', 'X'),
                             ('First', 'y')]},

              'Gaussian_Mixture':
                  {'operation': Gaussian_Mixture_Step,
                   'input': [('Concatenate_Xy', 'data')]},

              'Dbscan':
                  {'operation': Dbscan_Step,
                   'input': [('Concatenate_Xy', 'data')]},

              'CombineClustering':
                  {'operation': CombineClustering_Step,
                   'input': [('Dbscan', 'classification'),
                             ('Gaussian_Mixture', 'classification')]},

              'Paella':
                  {'operation': Paella_Step,
                   'input': [('First', 'X'),
                             ('First', 'y'),
                             ('Concatenate_Xy', 'data'),
                             ('CombineClustering', 'classification')]},

              'Regressor':
                  {'operation': Regressor_Step,
                   'input': [('First', 'X'),
                             ('First', 'y'),
                             ('Paella', 'sample_weight')]},

              'Last':
                  {'operation': Last_Step,
                   'input': [('Regressor', 'regressor')]},

             }

#%%
def create_graph(description):
    cg = nx.DiGraph()
    cg.add_nodes_from(description)
    for current_name, info in description.items():
        current_node = cg.node[current_name]
        current_node['operation'] = info['operation']( graph = cg,
node_name = current_name )
        current_node['input']     = info['input']
        if current_name != 'First':
            for ascendant in set( name for name, attribute in info['input']
):
                cg.add_edge(ascendant, current_name)

    return cg
#%%
cg = create_graph(graph_description)

node_pos = {'First'            : ( 0, 0),
            'Concatenate_Xy'   : ( 2, 4),
            'Gaussian_Mixture' : ( 6, 8),
            'Dbscan'           : ( 6, 6),
            'CombineClustering': ( 8, 7),
            'Paella'           : (10, 2),
            'Regressor'        : (12, 0),
            'Last'             : (16, 0)
            }

nx.draw(cg, pos=node_pos, with_labels=True)

#%%

print("=========================")
for name in nx.topological_sort(cg):
    print("Running: ", name)
    cg.node[name]['operation'].fit()

print("=========================")

########################





2017-12-22 12:09 GMT+01:00 Manuel Castejón Limas <manuel.castejon at gmail.com>
:

> I'm currently thinking on a computational graph which can then be wrapped
> as a pipeline like object ... I'll try yo make a toy example solving my
> problem.
>
> El 20 dic. 2017 16:33, "Manuel Castejón Limas" <manuel.castejon at gmail.com>
> escribió:
>
>> Thank you all for your interest!
>>
>> In order to clarify the case allow me to try to synthesize the spirit of
>> what I'd like to put into the pipeline using this sequence of steps:
>>
>> #%%
>> import pandas as pd
>> import numpy as np
>> import matplotlib.pyplot as plt
>>
>> from sklearn.cluster import DBSCAN
>> from sklearn.mixture import GaussianMixture
>> from sklearn.model_selection import train_test_split
>>
>> np.random.seed(seed=42)
>>
>> """
>> Data preparation
>> """
>>
>> URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/
>> sin_60_percent_noise.csv"
>> data = pd.read_csv(URL, usecols=['V1','V2'])
>> X, y = data[['V1']], data[['V2']]
>>
>> (data_train, data_test,
>>  X_train, X_test,
>>  y_train, y_test) = train_test_split(data, X, y)
>>
>> """
>> Parameters setup
>> """
>>
>> dbscan__eps = 0.06
>>
>> mclust__n_components = 3
>>
>> paella__noise_label = -1
>> paella__max_it = 20,
>> paella__regular_size = 400,
>> paella__minimum_size = 100,
>> paella__width_r = 0.99,
>> paella__n_neighbors = 5,
>> paella__power = 30,
>> paella__random_state = None
>>
>> #%%
>> """
>> DBSCAN clustering to detect noise suspects (label == -1)
>> """
>>
>> dbscan_input = data_train
>>
>> dbscan_clustering = DBSCAN(eps = dbscan__eps)
>>
>> dbscan_output = dbscan_clustering.fit_predict(dbscan_input)
>>
>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
>> c=np.int64(dbscan_output == -1))
>>
>> #%%
>> """
>> GaussianMixture fitted with filtered data_train in order to help locate
>> the ellipsoids
>> but predict is applied to the whole data_train set.
>> """
>>
>> mclust_input = data_train[ dbscan_output != 1]
>>
>> mclust_clustering = GaussianMixture(n_components = mclust__n_components)
>> mclust_clustering.fit(mclust_input)
>>
>> mclust_output = mclust_clustering.predict(data_train)
>>
>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
>> c=mclust_output)
>>
>> #%%
>> """
>> mclust and dbscan results are combined.
>> """
>>
>> clustering_output = mclust_output.copy()
>> clustering_output[dbscan_output == -1] =  -1
>>
>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
>> c=clustering_output)
>>
>> #%%
>> """
>> Old-good Paella paper: https://link.springer.c
>> om/article/10.1023/B:DAMI.0000031630.50685.7c
>>
>> The Paella algorithm calculates sample_weight to be used by the final
>> step regressor
>> (Yes, it is an outlier detection algorithm but we are focusing now on
>> this interesting collateral result). I am currently aggressively changing
>> the code in order to make it fit somehow with the pipeline
>> """
>>
>> from paella import Paella
>>
>> paella_input = pd.concat([data, clustering_output], axis=1, inplace=False)
>>
>> paella_run = Paella(noise_label = paella__noise_label,
>>                     max_it = paella__max_it,
>>                     regular_size = paella__regular_size,
>>                     minimum_size = paella__minimum_size,
>>                     width_r = paella__width_r,
>>                     n_neighbors = paella__n_neighbors,
>>                     power = paella__power,
>>                     random_state = paella__random_state)
>>
>> paella_output = paella_run.fit_predict(paella_input, y_train)
>> # paella_output is a vector with sample_weight
>>
>> #%%
>> """
>> Here we fit a regressor using sample_weight=paella_output
>> """
>> from sklearn.linear_model import LinearRegression
>>
>> regressor_input=X_train
>> lm = LinearRegression()
>> lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output)
>> regressor_output = lm.predict(X_train)
>>
>> #...
>>
>> In this example we can see that:
>> - A particular step might need results produced not necessarily from the
>> immediately previous step.
>> - The X parameter is not sequentially transformed. Sometimes we might
>> need to skip to a previous step
>> - y sometimes is the target, sometimes is not. For the regressor it is
>> indeed, but for the paella algorithm the prediction is expressed as a
>> vector representing sample_weights.
>>
>> All in all the conclusion is that the chain of processes is not as linear
>> as imposed by the current API. I guess that all these difficulties could be
>> solved by:
>> - Passing a dictionary through the different steps containing the partial
>> results that the following steps will need.
>> -  As a christmas gift :-) , a reference to the pipeline itself inserted
>> in that dictionary could provide access to the internal status of the
>> previous steps should it be needed.
>>
>> Another interesting study case with similar needs would be a regressor
>> using a previous clustering step in order to fit one model per cluster. In
>> such case, the clustering results would be needed during the fitting.
>>
>>
>> Thanks for your interest!
>> Manolo
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171226/91f1c9d7/attachment-0001.html>


More information about the scikit-learn mailing list