[scikit-learn] Why is subset invariance necessary for transfom()?

Mon Jan 20 19:49:12 EST 2020

Not all data transformers have a transform method. For those that do,
subset invariance is assumed as expressed
in check_methods_subset_invariance(). It must be the case that
T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic
projections - PCA, kernel PCA, etc., but not for some manifold learning
transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement
of the data in space is a constrained optimization, may take into account
the centroid of the dataset etc.

The manifold learners have "batch" oos transform() methods that aren't
implemented, and wouldn't pass that test. Instead, those that do -
LocallyLinearEmbedding - use a pointwise version, essentially replacing a
batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]:

    for i in range(X.shape[0]):
        X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i])

Where to implement the batch transform() methods for MDS,
SpectralEmbedding, LocallyLinearEmbedding, etc?

Another verb? Both batch and pointwise versions? The latter is easy to
implement once the batch version exists. Relax the test conditions?
transform() is necessary for oos testing, so necessary for cross
validation. The batch versions should be preferred, although as it stands,
the pointwise versions are.

Thanks
Charles Pehlivanian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200120/592a43db/attachment.html>