[scikit-learn] Any plans on generalizing Pipeline and transformers?

Wed Dec 20 10:33:19 EST 2017

Thank you all for your interest!

In order to clarify the case allow me to try to synthesize the spirit of
what I'd like to put into the pipeline using this sequence of steps:

#%%
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import train_test_split

np.random.seed(seed=42)

"""
Data preparation
"""

URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_
percent_noise.csv"
data = pd.read_csv(URL, usecols=['V1','V2'])
X, y = data[['V1']], data[['V2']]

(data_train, data_test,
 X_train, X_test,
 y_train, y_test) = train_test_split(data, X, y)

"""
Parameters setup
"""

dbscan__eps = 0.06

mclust__n_components = 3

paella__noise_label = -1
paella__max_it = 20,
paella__regular_size = 400,
paella__minimum_size = 100,
paella__width_r = 0.99,
paella__n_neighbors = 5,
paella__power = 30,
paella__random_state = None

#%%
"""
DBSCAN clustering to detect noise suspects (label == -1)
"""

dbscan_input = data_train

dbscan_clustering = DBSCAN(eps = dbscan__eps)

dbscan_output = dbscan_clustering.fit_predict(dbscan_input)

plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
c=np.int64(dbscan_output == -1))

#%%
"""
GaussianMixture fitted with filtered data_train in order to help locate the
ellipsoids
but predict is applied to the whole data_train set.
"""

mclust_input = data_train[ dbscan_output != 1]

mclust_clustering = GaussianMixture(n_components = mclust__n_components)
mclust_clustering.fit(mclust_input)

mclust_output = mclust_clustering.predict(data_train)

plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
c=mclust_output)

#%%
"""
mclust and dbscan results are combined.
"""

clustering_output = mclust_output.copy()
clustering_output[dbscan_output == -1] =  -1

plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
c=clustering_output)

#%%
"""
Old-good Paella paper: https://link.springer.com/article/10.1023/B:DAMI.
0000031630.50685.7c

The Paella algorithm calculates sample_weight to be used by the final step
regressor
(Yes, it is an outlier detection algorithm but we are focusing now on this
interesting collateral result). I am currently aggressively changing the
code in order to make it fit somehow with the pipeline
"""

from paella import Paella

paella_input = pd.concat([data, clustering_output], axis=1, inplace=False)

paella_run = Paella(noise_label = paella__noise_label,
                    max_it = paella__max_it,
                    regular_size = paella__regular_size,
                    minimum_size = paella__minimum_size,
                    width_r = paella__width_r,
                    n_neighbors = paella__n_neighbors,
                    power = paella__power,
                    random_state = paella__random_state)

paella_output = paella_run.fit_predict(paella_input, y_train)
# paella_output is a vector with sample_weight

#%%
"""
Here we fit a regressor using sample_weight=paella_output
"""
from sklearn.linear_model import LinearRegression

regressor_input=X_train
lm = LinearRegression()
lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output)
regressor_output = lm.predict(X_train)

#...

In this example we can see that:
- A particular step might need results produced not necessarily from the
immediately previous step.
- The X parameter is not sequentially transformed. Sometimes we might need
to skip to a previous step
- y sometimes is the target, sometimes is not. For the regressor it is
indeed, but for the paella algorithm the prediction is expressed as a
vector representing sample_weights.

All in all the conclusion is that the chain of processes is not as linear
as imposed by the current API. I guess that all these difficulties could be
solved by:
- Passing a dictionary through the different steps containing the partial
results that the following steps will need.
-  As a christmas gift :-) , a reference to the pipeline itself inserted in
that dictionary could provide access to the internal status of the previous
steps should it be needed.

Another interesting study case with similar needs would be a regressor
using a previous clustering step in order to fit one model per cluster. In
such case, the clustering results would be needed during the fitting.

Thanks for your interest!
Manolo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171220/14069400/attachment.html>