[scikit-learn] Why is subset invariance necessary for transfom()?

Andreas Mueller t3kcit at gmail.com
Tue Jan 21 20:33:28 EST 2020



On 1/21/20 8:23 PM, Charles Pehlivanian wrote:
> I understand - I'm kind of conflating the idea of data sample with test set, my view assumes there are a sample space of samples, might require rethinking the cross-validation setup...
> I also think that part of it relies on the notion of online vs. offline algorithm. For offline fits, a batch transform (non-subset invariant) is preferred. For a transformer that can only be used in an online sense, or is primarily used that way, keep the invariant.

> I see 3 options here - all I can say is that I don't vote for the first
> + No transform method on the manifold learners, so no cross-validation
This is what I thought we usually do. It looks like you said we are 
doing a greedy transform.
I'm not sure I follow that. In particular for spectral embedding for 
example there is a pretty way to describe
the transform and that's what we're doing. You could also look at doing 
transductive learning but that's
not really the standard formulation, is it?

> + Pointwise, distributable, subset-invariant, suboptimal greedy transform
> + Non-distributable, non-subset-invariant, optimal batch transform
Can you give an example of that?
> -Charles
> On Mon., Jan. 20, 21:24:52 2020 <joel.nothman at gmail.com  <mailto:scikit-learn%40python.org?Subject=Re%3A%20%5Bscikit-learn%5D%20Why%20is%20subset%20invariance%20necessary%20for%0A%20transfom%28%29%3F&In-Reply-To=%3CCAAkaFLWfWyu%2BDdQ3RX5tBays6jLX6A3W_QpqAcWn_RAxbRz5cQ%40mail.gmail.com%3E>> wrote
> I think allowing subset invariance to not hold is making stronger
> assumptions than we usually do about what it means to have a "test set".
> Having a transformation like this that relies on test set statistics
> implies that the test set is more than just selected samples, but rather
> that a large collection of samples is available at one time, and that it is
> in some sense sufficient or complete (no more samples are available that
> would give a better fit). So in a predictive modelling context you might
> have to set up your cross validation splits with this in mind.
>
> In terms of API, the subset invariance constraint allows us to assume that
> the transformation can be distributed or parallelized over samples. I'm not
> sure whether we have exploited that assumption within scikit-learn or
> whether related projects do so.
>
> I see the benefit of using such transformations in a prediction Pipeline,
> and really appreciate this challenge to our assumptions of what "transform"
> means.
>
> Joel
>
> On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, <
> pehlivaniancharles at gmail.com  <https://mail.python.org/mailman/listinfo/scikit-learn>> wrote:
>
> >/Not all data transformers have a transform method. For those that do, />/subset invariance is assumed as expressed />/in check_methods_subset_invariance(). It must be the case that />/T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for 
> classic />/projections - PCA, kernel PCA, etc., but not for some manifold learning />/transformers - MDS, SpectralEmbedding, etc. For those, an optimal 
> placement />/of the data in space is a constrained optimization, may take into 
> account />/the centroid of the dataset etc. />//>/The manifold learners have "batch" oos transform() methods that aren't />/implemented, and wouldn't pass that test. Instead, those that do - />/LocallyLinearEmbedding - use a pointwise version, essentially 
> replacing a />/batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]: />//>/for i in range(X.shape[0]): />/X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i]) />//>/Where to implement the batch transform() methods for MDS, />/SpectralEmbedding, LocallyLinearEmbedding, etc? />//>/Another verb? Both batch and pointwise versions? The latter is easy to />/implement once the batch version exists. Relax the test conditions? />/transform() is necessary for oos testing, so necessary for cross />/validation. The batch versions should be preferred, although as it 
> stands, />/the pointwise versions are. />//>/Thanks />/Charles Pehlivanian />/_______________________________________________ />/scikit-learn mailing list />/scikit-learn at python.org 
> <https://mail.python.org/mailman/listinfo/scikit-learn> />/https://mail.python.org/mailman/listinfo/scikit-learn />//-------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200121/b402c42e/attachment.html>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200121/76f8e896/attachment-0001.html>


More information about the scikit-learn mailing list