[scikit-learn] Problem with check_estimator for distance metric learning
wdevazel
william.de-vazelhes at inria.fr
Thu Mar 29 07:15:42 EDT 2018
Hi all,
We are currently trying to add to the metric-learn package
(https://github.com/metric-learn/metric-learn) a feature that would
allow to do cross-validation with Weakly Supervised Metric Learners
using scikit-learn's cross-validation routines.
Distance Metric Learning algorithms learn distance metrics between
samples, using some supervised information about similarity between
training samples. Some Metric Learning algorithms are weakly supervised
(Weakly Supervised Metric Learners), i.e. they do not train on labeled
samples, but for instance on labeled *pairs* of samples (the label
telling whether the pair is of similar or dissimilar samples).
To cross-validate these algorithms, we make a train and a test by
splitting on the pairs. Indeed a use case of metric learning is to
classify at test time unseen pairs as similar or dissimilar (those pairs
can involve already seen samples). For that, we made a dataset
representation that allows to easily slice on pairs of samples: we mock
a 3D array containing pairs of samples, that would be of shape
(n_constraints, 2, n_features) (each line is a pair of samples). We do
so with an object that we called ConstrainedDataset, which is more
memory efficient than the described array (because samples would be
duplicated through pairs).
Now we have a problem when running scikit-learn's *check_estimator* on
these algorithms, because it launches a series of tests where the
estimator takes as input regular arrays, whereas Weakly Supervised
Metric Learners always learn on ConstrainedDatasets (or more generally
on pairs, or tuples for some other algorithms).
We therefore thought of two main possibilities (that could be combined)
to solve this problem:
- taking the maximum number of tests yielded by check_estimator that
pass in our setting, and modifying the others by replacing array inputs
with ConstrainedDatasets
- wrapping a Weakly Supervised Metric Learner into a
MockSklearnEstimator that would transform any array as input into a
ConstrainedDataset before passing it to the underlying Weakly Supervised
Metric Learner
However these options are not really satisfying: the first one will
create a lot of code and after that one cannot see at a glance if the
estimator passes scikit-learn's check_estimator, and the second adds so
much wrapping that we are not even really testing the Weakly Supervised
Metric Learner)
For more information, see this PR where the new feature is being
implemented, including the constraints.ConstrainedDataset object, as
well as a comment on what is problematic when using scikit-learn's
check_estimator:
https://github.com/metric-learn/metric-learn/pull/85#issuecomment-375659820
Any advice about how to design the weakly supervised algorithms, the
data structure containing the pairs of samples, or how to use anyway
scikit-learn's check_estimator would be appreciated!
Thanks!
Best regards,
William
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180329/06b64518/attachment.html>
More information about the scikit-learn
mailing list