[scikit-learn] Problem with check_estimator for distance metric learning

Thu Mar 29 07:15:42 EDT 2018

Hi all,

We are currently trying to add to the metric-learn package 
(https://github.com/metric-learn/metric-learn) a feature that would 
allow to do cross-validation with Weakly Supervised Metric Learners 
using scikit-learn's cross-validation routines.

Distance Metric Learning algorithms learn distance metrics between 
samples, using some supervised information about similarity between 
training samples. Some Metric Learning algorithms are weakly supervised 
(Weakly Supervised Metric Learners), i.e. they do not train on labeled 
samples, but for instance on labeled *pairs* of samples (the label 
telling whether the pair is of similar or dissimilar samples).

To cross-validate these algorithms, we make a train and a test by 
splitting on the pairs. Indeed a use case of metric learning is to 
classify at test time unseen pairs as similar or dissimilar (those pairs 
can involve already seen samples). For that, we made a dataset 
representation that allows to easily slice on pairs of samples: we mock 
a 3D array containing pairs of samples, that would be of shape 
(n_constraints, 2, n_features) (each line is a pair of samples). We do 
so with an object that we called ConstrainedDataset, which is more 
memory efficient than the described array (because samples would be 
duplicated through pairs).

Now we have a problem when running scikit-learn's *check_estimator* on 
these algorithms, because it launches a series of tests where the 
estimator takes as input regular arrays, whereas Weakly Supervised 
Metric Learners always learn on ConstrainedDatasets (or more generally 
on pairs, or tuples for some other algorithms).

We therefore thought of two main possibilities (that could be combined) 
to solve this problem:
- taking the maximum number of tests yielded by check_estimator that 
pass in our setting, and modifying the others by replacing array inputs 
with ConstrainedDatasets
- wrapping a Weakly Supervised Metric Learner into a 
MockSklearnEstimator that would transform any array as input into a 
ConstrainedDataset before passing it to the underlying Weakly Supervised 
Metric Learner

However these options are not really satisfying: the first one will 
create a lot of code and after that one cannot see at a glance if the 
estimator passes scikit-learn's check_estimator, and the second adds so 
much wrapping that we are not even really testing the Weakly Supervised 
Metric Learner)

For more information, see this PR where the new feature is being 
implemented, including the constraints.ConstrainedDataset object, as 
well as a comment on what is problematic when using scikit-learn's 
check_estimator:
https://github.com/metric-learn/metric-learn/pull/85#issuecomment-375659820

Any advice about how to design the weakly supervised algorithms, the 
data structure containing the pairs of samples, or how to use anyway 
scikit-learn's check_estimator would be appreciated!

Thanks!

Best regards,

William
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180329/06b64518/attachment.html>