<p dir="ltr">This start to look as the dask project. Do you know it?</p>
<br><div class="gmail_quote"><div dir="ltr">Le mar. 26 déc. 2017 05:49, Manuel Castejón Limas <<a href="mailto:manuel.castejon@gmail.com">manuel.castejon@gmail.com</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>I'm elaborating on the graph idea. A dictionary to describe the graph, the networkx package to support the graph and run it in topological order; and some wrappers for scikit-learn models. </div><div><br></div><div>I'm currently thinking on putting some more efforts into a contrib project.</div><div><br></div><div><div>It could be something inspired by this example.</div></div><div><br></div><div>Manolo</div><div><br></div><div>#-------------------------------------------------</div><div><br></div><div><br></div><div><br></div><div>graph_description = {</div><div> 'First': </div><div> {'operation': First_Step,</div><div> 'input': {'X':X, 'y':y}},</div><div> </div><div> 'Concatenate_Xy':</div><div> {'operation': ConcatenateData_Step,</div><div> 'input': [('First', 'X'),</div><div> ('First', 'y')]},</div><div> </div><div> 'Gaussian_Mixture':</div><div> {'operation': Gaussian_Mixture_Step,</div><div> 'input': [('Concatenate_Xy', 'data')]},</div><div> </div><div> 'Dbscan':</div><div> {'operation': Dbscan_Step,</div><div> 'input': [('Concatenate_Xy', 'data')]},</div><div> </div><div> 'CombineClustering':</div><div> {'operation': CombineClustering_Step,</div><div> 'input': [('Dbscan', 'classification'),</div><div> ('Gaussian_Mixture', 'classification')]},</div><div> </div><div> 'Paella':</div><div> {'operation': Paella_Step,</div><div> 'input': [('First', 'X'),</div><div> ('First', 'y'),</div><div> ('Concatenate_Xy', 'data'),</div><div> ('CombineClustering', 'classification')]},</div><div><br></div><div> 'Regressor':</div><div> {'operation': Regressor_Step,</div><div> 'input': [('First', 'X'),</div><div> ('First', 'y'),</div><div> ('Paella', 'sample_weight')]},</div><div> </div><div> 'Last':</div><div> {'operation': Last_Step,</div><div> 'input': [('Regressor', 'regressor')]},</div><div> </div><div> }</div><div><br></div><div>#%%</div><div>def create_graph(description):</div><div> cg = nx.DiGraph()</div><div> cg.add_nodes_from(description)</div><div> for current_name, info in description.items():</div><div> current_node = cg.node[current_name]</div><div> current_node['operation'] = info['operation']( graph = cg, node_name = current_name )</div><div> current_node['input'] = info['input']</div><div> if current_name != 'First':</div><div> for ascendant in set( name for name, attribute in info['input'] ):</div><div> cg.add_edge(ascendant, current_name)</div><div><br></div><div> return cg</div><div>#%%</div><div>cg = create_graph(graph_description)</div><div><br></div><div>node_pos = {'First' : ( 0, 0),</div><div> 'Concatenate_Xy' : ( 2, 4),</div><div> 'Gaussian_Mixture' : ( 6, 8), </div><div> 'Dbscan' : ( 6, 6),</div><div> 'CombineClustering': ( 8, 7),</div><div> 'Paella' : (10, 2),</div><div> 'Regressor' : (12, 0),</div><div> 'Last' : (16, 0)</div><div> }</div><div><br></div><div>nx.draw(cg, pos=node_pos, with_labels=True)</div><div><br></div><div>#%%</div><div><br></div><div>print("=========================")</div><div>for name in nx.topological_sort(cg):</div><div> print("Running: ", name)</div><div> cg.node[name]['operation'].fit()</div><div><br></div><div>print("=========================")</div><div><br></div><div>########################</div><div><br></div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">2017-12-22 12:09 GMT+01:00 Manuel Castejón Limas <span dir="ltr"><<a href="mailto:manuel.castejon@gmail.com" target="_blank">manuel.castejon@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto">I'm currently thinking on a computational graph which can then be wrapped as a pipeline like object ... I'll try yo make a toy example solving my problem. </div><div class="gmail_extra"><br><div class="gmail_quote">El 20 dic. 2017 16:33, "Manuel Castejón Limas" <<a href="mailto:manuel.castejon@gmail.com" target="_blank">manuel.castejon@gmail.com</a>> escribió:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div dir="auto"><div class="gmail_extra"><blockquote type="cite" style="font-size:12.8px"><div dir="ltr">Thank you all for your interest!</div><div dir="ltr"><div><br></div><div>In order to clarify the case allow me to try<span style="font-size:small"> to synthesize the spirit of what I'd like to put into the pipeline using this sequence of steps:</span></div><div><br></div><div><font face="monospace, monospace">#%%</font></div><div><div><font face="monospace, monospace">import pandas as pd<br></font></div><div><font face="monospace, monospace">import numpy as np</font></div><div><font face="monospace, monospace">import matplotlib.pyplot as plt</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">from sklearn.cluster import DBSCAN </font></div><div><font face="monospace, monospace">from sklearn.mixture import GaussianMixture</font></div><div><font face="monospace, monospace">from sklearn.model_selection import train_test_split</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">np.random.seed(seed=42)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace">Data preparation</font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">URL = "<a href="https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_percent_noise.csv" target="_blank">https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_percent_noise.csv</a>"</font></div><div><font face="monospace, monospace">data = pd.read_csv(URL, usecols=['V1','V2'])</font></div><div><font face="monospace, monospace">X, y = data[['V1']], data[['V2']]</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">(data_train, data_test,</font></div><div><font face="monospace, monospace"> X_train, X_test,</font></div><div><font face="monospace, monospace"> y_train, y_test) = train_test_split(data, X, y)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace">Parameters setup</font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">dbscan__eps = 0.06</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">mclust__n_components = 3</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">paella__noise_label = -1</font></div><div><font face="monospace, monospace">paella__max_it = 20,</font></div><div><font face="monospace, monospace">paella__regular_size = 400, </font></div><div><font face="monospace, monospace">paella__minimum_size = 100, </font></div><div><font face="monospace, monospace">paella__width_r = 0.99,</font></div><div><font face="monospace, monospace">paella__n_neighbors = 5,</font></div><div><font face="monospace, monospace">paella__power = 30,</font></div><div><font face="monospace, monospace">paella__random_state = None</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">#%%</font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace">DBSCAN clustering to detect noise suspects (label == -1)</font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">dbscan_input = data_train</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">dbscan_clustering = DBSCAN(eps = dbscan__eps)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">dbscan_output = dbscan_clustering.fit_predict(dbscan_input)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, c=np.int64(dbscan_output == -1))</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">#%%</font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace">GaussianMixture fitted with filtered data_train in order to help locate the ellipsoids</font></div><div><font face="monospace, monospace">but predict is applied to the whole data_train set.</font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">mclust_input = data_train[ dbscan_output != 1]</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">mclust_clustering = GaussianMixture(n_components = mclust__n_components)</font></div><div><font face="monospace, monospace">mclust_clustering.fit(mclust_input)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">mclust_output = mclust_clustering.predict(data_train)</font></div><div><font face="monospace, monospace"> </font></div><div><font face="monospace, monospace">plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, c=mclust_output)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">#%%</font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace">mclust and dbscan results are combined.</font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">clustering_output = mclust_output.copy() </font></div><div><font face="monospace, monospace">clustering_output[dbscan_output == -1] = -1</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, c=clustering_output)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">#%%</font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace">Old-good Paella paper: <a href="https://link.springer.com/article/10.1023/B:DAMI.0000031630.50685.7c" target="_blank">https://link.springer.com/article/10.1023/B:DAMI.0000031630.50685.7c</a></font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">The Paella algorithm calculates sample_weight to be used by the final step regressor</font></div><div><font face="monospace, monospace">(Yes, it is an outlier detection algorithm but we are focusing now on this interesting collateral result). I am currently aggressively changing the code in order to make it fit somehow with the pipeline</font></div><div><span style="font-family:monospace,monospace;font-size:small">"""</span><br></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">from paella import Paella</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">paella_input = pd.concat([data, clustering_output], axis=1, inplace=False)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">paella_run = Paella(noise_label = paella__noise_label,</font></div><div><font face="monospace, monospace"> max_it = paella__max_it,</font></div><div><font face="monospace, monospace"> regular_size = paella__regular_size, </font></div><div><font face="monospace, monospace"> minimum_size = paella__minimum_size, </font></div><div><font face="monospace, monospace"> width_r = paella__width_r,</font></div><div><font face="monospace, monospace"> n_neighbors = paella__n_neighbors,</font></div><div><font face="monospace, monospace"> power = paella__power,</font></div><div><font face="monospace, monospace"> random_state = paella__random_state)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">paella_output = paella_run.fit_predict(paella_input, y_train)</font></div><div><font face="monospace, monospace"># paella_output is a vector with sample_weight</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">#%%</font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace">Here we fit a regressor using sample_weight=paella_output</font></div><div><font face="monospace, monospace">"""</font></div><div><font face="monospace, monospace">from sklearn.linear_model import LinearRegression</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">regressor_input=X_train</font></div><div><font face="monospace, monospace">lm = LinearRegression()</font></div><div><font face="monospace, monospace">lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output)</font></div><div><font face="monospace, monospace">regressor_output = lm.predict(X_train)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">#...</font></div><div><br></div></div><div>In this example we can see that:</div><div>- A particular step might need results produced not necessarily from the immediately previous step.</div><div>- The X parameter is not sequentially transformed. Sometimes we might need to skip to a previous step </div><div>- y sometimes is the target, sometimes is not. For the regressor it is indeed, but for the paella algorithm the prediction is expressed as a vector representing sample_weights.</div><div><br></div><div>All in all the conclusion is that the chain of processes is not as linear as imposed by the current API. I guess that all these difficulties could be solved by:</div><div>- Passing a dictionary through the different steps containing the partial results that the following steps will need.</div><div>- As a christmas gift :-) , a reference to the pipeline itself inserted in that dictionary could provide access to the internal status of the previous steps should it be needed.</div><div><br></div><div>Another interesting study case with similar needs would be a regressor using a previous clustering step in order to fit one model per cluster. In such case, the clustering results would be needed during the fitting.</div><div><br></div><div><br></div><div>Thanks for your interest!</div><div>Manolo<br></div></div></blockquote></div></div>
</div>
</blockquote></div></div>
</blockquote></div><br></div>
_______________________________________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
</blockquote></div>