[scikit-learn] Any plans on generalizing Pipeline and transformers?

Joel Nothman joel.nothman at gmail.com
Tue Dec 19 19:09:37 EST 2017


At a glance, and perhaps not knowing imbalanced-learn well enough, I have
some doubts that it will provide an immediate solution for all your needs.

At the end of the day, the Pipeline keeps its scope relatively tight, but
it should not be so hard to implement something for your own needs if your
case does not fit what Pipeline supports.

On 20 December 2017 at 00:34, Manuel Castejón Limas <
manuel.castejon at gmail.com> wrote:

> Eager to learn! Diving on the code right now!
>
> Thanks for the tip!
> Manuel
>
> 2017-12-19 14:18 GMT+01:00 Guillaume Lemaître <g.lemaitre58 at gmail.com>:
>
>> I think that you could you use imbalanced-learn regarding the issue that
>> you have with the y.
>> You should be able to wrap your clustering inside the FunctionSampler (
>> https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we
>> are on the way to merge it)
>>
>> On 19 December 2017 at 13:44, Manuel Castejón Limas <
>> manuel.castejon at gmail.com> wrote:
>>
>>> Dear all,
>>>
>>> Kudos to scikit-learn! Having said that, Pipeline is killing me not
>>> being able to transform anything other than X.
>>>
>>> My current study case would need:
>>> - Transformers being able to handle both X and y, e.g. clustering X and
>>> y concatenated
>>> - Pipeline being able to change other params, e.g. sample_weight
>>>
>>> Currently, I'm augmenting X through every step with the extra
>>> information which seems to work ok for my_pipe.fit_transform(X_train,y_train)
>>> but breaks on my_pipe.transform(X_test) for the lack of the y parameter.
>>> Ok, I can inherit and modify a descendant from Pipeline class to allow the
>>> y parameter which is not ideal but I guess it is an option. The gritty part
>>> comes when having to adapt every regressor at the end of the ladder in
>>> order to split the extra information from the raw data in X and not being
>>> able to generate more than one subproduct from each preprocessing step
>>>
>>> My current research involves clustering the data and using that
>>> classification along with X in order to predict outliers which generates
>>> sample_weight info and I would love to use that on the final regressor.
>>> Currently there seems not to be another option than pasting that info on X.
>>>
>>> All in all, I'm stuck with this API limitation and I would love to learn
>>> some tricks from you if you could enlighten me.
>>>
>>> Thanks in advance!
>>>
>>> Manuel Castejón-Limas
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> Guillaume Lemaitre
>> INRIA Saclay - Parietal team
>> Center for Data Science Paris-Saclay
>> https://glemaitre.github.io/
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171220/22f450ef/attachment-0001.html>


More information about the scikit-learn mailing list