[scikit-learn] [Vote] SLEP006: Routing sample-aligned metadata

Joel Nothman joel.nothman at gmail.com
Sat Feb 27 04:42:35 EST 2021


Hi all,

Just a reminder that we are ten days into the month-long voting period,
with one vote on record. Core devs, please find time to consider this
proposal. Thanks to Andy's suggestion, we have added an example of the new
API to the opening section:


This SLEP proposes an API where users can request certain metadata to be
passed to its consumer by the meta-estimator it is wrapped in.

The following example illustrates the new request_metadata parameter for
making scorers, the request_sample_weight estimator method, the
metadata parameter
replacing fit_params in cross_validate, and the automatic passing of groups
<https://scikit-learn.org/stable/glossary.html#term-groups> to GroupKFold to
enable nested grouped cross validation. Here, the user requests that the
sample_weight
<https://scikit-learn.org/stable/glossary.html#term-sample_weight> metadata
key should be passed to a customised accuracy scorer (although a predefined
‘weighted_accuracy’ scorer could be introduced), and to the
LogisticRegressionCV. GroupKFold requests groups
<https://scikit-learn.org/stable/glossary.html#term-groups> by default.

>>> from sklearn.metrics import accuracy_score, make_scorer>>> from sklearn.model_selection import cross_validate, GroupKFold>>> from sklearn.linear_model import LogisticRegressionCV>>> weighted_acc = make_scorer(accuracy_score,...                            request_metadata=['sample_weight'])>>> group_cv = GroupKFold()>>> lr = LogisticRegressionCV(...    cv=group_cv,...    scoring=weighted_acc,... ).request_sample_weight(fit=True)>>> cross_validate(lr, X, y, cv=group_cv,...                metadata={'sample_weight': my_weights,...                          'groups': my_groups},...                scoring=weighted_acc)


On Thu, 18 Feb 2021 at 00:08, Joel Nothman <joel.nothman at gmail.com> wrote:

> With thanks to Alex, Adrin and Christian, we have a proposal to implement
> what we used to call "sample props" that should be expressive enough for us
> to resolve tens of issues and PRs, but will be largely unobtrusive for most
> current users.
>
> Core developers, please cast your vote in this PR
> <https://github.com/scikit-learn/enhancement_proposals/pull/52> after
> considering the proposal here
> <https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep006/proposal.html>,
> which has a partial implementation in #16079
> <https://github.com/scikit-learn/scikit-learn/pull/16079>.
>
>
> In brief, the problem we are trying to solve:
>
> Scikit-learn has limited support for information pertaining to each sample
> (henceforth “sample properties”) to be passed through an estimation
> pipeline. The user can, for instance, pass fit parameters to all members of
> a FeatureUnion, or to a specified member of a Pipeline using dunder (__)
> prefixing:
>
> >>> from sklearn.pipeline import Pipeline>>> from sklearn.linear_model import LogisticRegression>>> pipe = Pipeline([('clf', LogisticRegression())])>>> pipe.fit([[1, 2], [3, 4]], [5, 6],...          clf__sample_weight=[.5, .7])
>
> Several other meta-estimators, such as GridSearchCV, support forwarding
> these fit parameters to their base estimator when fitting. Yet a number of
> important use cases are currently not supported.
>
> Features we currently do not support and wish to include:
>
>    - passing sample properties (e.g. sample_weight
>    <https://scikit-learn.org/stable/glossary.html#term-sample_weight>) to
>    a scorer used in cross-validation
>    - passing sample properties (e.g. groups
>    <https://scikit-learn.org/stable/glossary.html#term-groups>) to a CV
>    splitter in nested cross validation
>    - passing sample properties (e.g. sample_weight
>    <https://scikit-learn.org/stable/glossary.html#term-sample_weight>) to
>    some scorers and not others in a multi-metric cross-validation setup
>
> Solution: Each consumer requests
>
> A meta-estimator provides along to its children only what they request. A
> meta-estimator needs to request, on behalf of its children, any metadata
> that descendant consumers request.
>
> Each object that could receive metadata should have a method called
> get_metadata_request() which returns a dict that specifies which metadata
> is consumed by each of its methods (keys of this dictionary are therefore
> method names, e.g. fit
> <https://scikit-learn.org/stable/glossary.html#term-fit>, transform
> <https://scikit-learn.org/stable/glossary.html#term-transform> etc.).
> Estimators supporting weighted fitting may return {} by default, but have
> a method called request_sample_weight which allows the user to specify
> the requested sample_weight
> <https://scikit-learn.org/stable/glossary.html#term-sample_weight> in
> each of its methods. make_scorer accepts request_metadata as keyword
> parameter through which the user can specify what metadata is requested.
>
> Regards,
>
> Joel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20210227/46aec610/attachment-0001.html>


More information about the scikit-learn mailing list