[scikit-learn] [Vote] SLEP006: Routing sample-aligned metadata

Wed Feb 17 08:08:43 EST 2021

With thanks to Alex, Adrin and Christian, we have a proposal to implement
what we used to call "sample props" that should be expressive enough for us
to resolve tens of issues and PRs, but will be largely unobtrusive for most
current users.

Core developers, please cast your vote in this PR
<https://github.com/scikit-learn/enhancement_proposals/pull/52> after
considering the proposal here
<https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep006/proposal.html>,
which has a partial implementation in #16079
<https://github.com/scikit-learn/scikit-learn/pull/16079>.

In brief, the problem we are trying to solve:

Scikit-learn has limited support for information pertaining to each sample
(henceforth “sample properties”) to be passed through an estimation
pipeline. The user can, for instance, pass fit parameters to all members of
a FeatureUnion, or to a specified member of a Pipeline using dunder (__)
prefixing:

>>> from sklearn.pipeline import Pipeline>>> from sklearn.linear_model import LogisticRegression>>> pipe = Pipeline([('clf', LogisticRegression())])>>> pipe.fit([[1, 2], [3, 4]], [5, 6],...          clf__sample_weight=[.5, .7])

Several other meta-estimators, such as GridSearchCV, support forwarding
these fit parameters to their base estimator when fitting. Yet a number of
important use cases are currently not supported.

Features we currently do not support and wish to include:

   - passing sample properties (e.g. sample_weight
   <https://scikit-learn.org/stable/glossary.html#term-sample_weight>) to a
   scorer used in cross-validation
   - passing sample properties (e.g. groups
   <https://scikit-learn.org/stable/glossary.html#term-groups>) to a CV
   splitter in nested cross validation
   - passing sample properties (e.g. sample_weight
   <https://scikit-learn.org/stable/glossary.html#term-sample_weight>) to
   some scorers and not others in a multi-metric cross-validation setup

Solution: Each consumer requests

A meta-estimator provides along to its children only what they request. A
meta-estimator needs to request, on behalf of its children, any metadata
that descendant consumers request.

Each object that could receive metadata should have a method called
get_metadata_request() which returns a dict that specifies which metadata
is consumed by each of its methods (keys of this dictionary are therefore
method names, e.g. fit
<https://scikit-learn.org/stable/glossary.html#term-fit>, transform
<https://scikit-learn.org/stable/glossary.html#term-transform> etc.).
Estimators supporting weighted fitting may return {} by default, but have a
method called request_sample_weight which allows the user to specify the
requested sample_weight
<https://scikit-learn.org/stable/glossary.html#term-sample_weight> in each
of its methods. make_scorer accepts request_metadata as keyword parameter
through which the user can specify what metadata is requested.

Regards,

Joel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20210218/0c253714/attachment.html>