<div dir="auto">I think allowing subset invariance to not hold is making stronger assumptions than we usually do about what it means to have a "test set". Having a transformation like this that relies on test set statistics implies that the test set is more than just selected samples, but rather that a large collection of samples is available at one time, and that it is in some sense sufficient or complete (no more samples are available that would give a better fit). So in a predictive modelling context you might have to set up your cross validation splits with this in mind. <div dir="auto"><br></div><div dir="auto">In terms of API, the subset invariance constraint allows us to assume that the transformation can be distributed or parallelized over samples. I'm not sure whether we have exploited that assumption within scikit-learn or whether related projects do so.</div><div dir="auto"><div dir="auto"><br></div><div dir="auto">I see the benefit of using such transformations in a prediction Pipeline, and really appreciate this challenge to our assumptions of what "transform" means.</div><div dir="auto"><br></div><div dir="auto">Joel</div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, <<a href="mailto:pehlivaniancharles@gmail.com">pehlivaniancharles@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Not all data transformers have a transform method. For those that do, subset invariance is assumed as expressed in check_methods_subset_invariance(). It must be the case that T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic projections - PCA, kernel PCA, etc., but not for some manifold learning transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement of the data in space is a constrained optimization, may take into account the centroid of the dataset etc. <div><br></div><div>The manifold learners have "batch" oos transform() methods that aren't implemented, and wouldn't pass that test. Instead, those that do - LocallyLinearEmbedding - use a pointwise version, essentially replacing a batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]:</div><div><br></div><div> for i in range(X.shape[0]):<br> X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i])<br></div><div><br></div><div>Where to implement the batch transform() methods for MDS, SpectralEmbedding, LocallyLinearEmbedding, etc? </div><div><br></div><div>Another verb? Both batch and pointwise versions? The latter is easy to implement once the batch version exists. Relax the test conditions? transform() is necessary for oos testing, so necessary for cross validation. The batch versions should be preferred, although as it stands, the pointwise versions are. </div><div><br></div><div>Thanks</div><div>Charles Pehlivanian</div></div>
_______________________________________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank" rel="noreferrer">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
</blockquote></div>