Why is subset invariance necessary for transfom()?
I understand - I'm kind of conflating the idea of data sample with test set, my view assumes there are a sample space of samples, might require rethinking the cross-validation setup... I also think that part of it relies on the notion of online vs. offline algorithm. For offline fits, a batch transform (non-subset invariant) is preferred. For a transformer that can only be used in an online sense, or is primarily used that way, keep the invariant. I see 3 options here - all I can say is that I don't vote for the first + No transform method on the manifold learners, so no cross-validation + Pointwise, distributable, subset-invariant, suboptimal greedy transform + Non-distributable, non-subset-invariant, optimal batch transform -Charles On Mon., Jan. 20, 21:24:52 2020 <joel.nothman at gmail.com <scikit-learn%40python.org?Subject=Re%3A%20%5Bscikit-learn%5D%20Why%20is%20subset%20invariance%20necessary%20for%0A%20transfom%28%29%3F&In-Reply-To=%3CCAAkaFLWfWyu%2BDdQ3RX5tBays6jLX6A3W_QpqAcWn_RAxbRz5cQ%40mail.gmail.com%3E>> wrote I think allowing subset invariance to not hold is making stronger assumptions than we usually do about what it means to have a "test set". Having a transformation like this that relies on test set statistics implies that the test set is more than just selected samples, but rather that a large collection of samples is available at one time, and that it is in some sense sufficient or complete (no more samples are available that would give a better fit). So in a predictive modelling context you might have to set up your cross validation splits with this in mind. In terms of API, the subset invariance constraint allows us to assume that the transformation can be distributed or parallelized over samples. I'm not sure whether we have exploited that assumption within scikit-learn or whether related projects do so. I see the benefit of using such transformations in a prediction Pipeline, and really appreciate this challenge to our assumptions of what "transform" means. Joel On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, <pehlivaniancharles at gmail.com <https://mail.python.org/mailman/listinfo/scikit-learn>> wrote:
* Not all data transformers have a transform method. For those that do, *>* subset invariance is assumed as expressed *>* in check_methods_subset_invariance(). It must be the case that *>* T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic *>* projections - PCA, kernel PCA, etc., but not for some manifold learning *>* transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement *>* of the data in space is a constrained optimization, may take into account *>* the centroid of the dataset etc. *>>* The manifold learners have "batch" oos transform() methods that aren't *>* implemented, and wouldn't pass that test. Instead, those that do - *>* LocallyLinearEmbedding - use a pointwise version, essentially replacing a *>* batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]: *>>* for i in range(X.shape[0]): *>* X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i]) *>>* Where to implement the batch transform() methods for MDS, *>* SpectralEmbedding, LocallyLinearEmbedding, etc? *>>* Another verb? Both batch and pointwise versions? The latter is easy to *>* implement once the batch version exists. Relax the test conditions? *>* transform() is necessary for oos testing, so necessary for cross *>* validation. The batch versions should be preferred, although as it stands, *>* the pointwise versions are. *>>* Thanks *>* Charles Pehlivanian *>* _______________________________________________ *>* scikit-learn mailing list *>* scikit-learn at python.org <https://mail.python.org/mailman/listinfo/scikit-learn> *>* https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn> *>
On 1/21/20 8:23 PM, Charles Pehlivanian wrote:
I understand - I'm kind of conflating the idea of data sample with test set, my view assumes there are a sample space of samples, might require rethinking the cross-validation setup... I also think that part of it relies on the notion of online vs. offline algorithm. For offline fits, a batch transform (non-subset invariant) is preferred. For a transformer that can only be used in an online sense, or is primarily used that way, keep the invariant.
I see 3 options here - all I can say is that I don't vote for the first + No transform method on the manifold learners, so no cross-validation This is what I thought we usually do. It looks like you said we are doing a greedy transform. I'm not sure I follow that. In particular for spectral embedding for example there is a pretty way to describe the transform and that's what we're doing. You could also look at doing transductive learning but that's not really the standard formulation, is it?
+ Pointwise, distributable, subset-invariant, suboptimal greedy transform + Non-distributable, non-subset-invariant, optimal batch transform Can you give an example of that? -Charles On Mon., Jan. 20, 21:24:52 2020 <joel.nothman at gmail.com <mailto:scikit-learn%40python.org?Subject=Re%3A%20%5Bscikit-learn%5D%20Why%20is%20subset%20invariance%20necessary%20for%0A%20transfom%28%29%3F&In-Reply-To=%3CCAAkaFLWfWyu%2BDdQ3RX5tBays6jLX6A3W_QpqAcWn_RAxbRz5cQ%40mail.gmail.com%3E>> wrote I think allowing subset invariance to not hold is making stronger assumptions than we usually do about what it means to have a "test set". Having a transformation like this that relies on test set statistics implies that the test set is more than just selected samples, but rather that a large collection of samples is available at one time, and that it is in some sense sufficient or complete (no more samples are available that would give a better fit). So in a predictive modelling context you might have to set up your cross validation splits with this in mind.
In terms of API, the subset invariance constraint allows us to assume that the transformation can be distributed or parallelized over samples. I'm not sure whether we have exploited that assumption within scikit-learn or whether related projects do so.
I see the benefit of using such transformations in a prediction Pipeline, and really appreciate this challenge to our assumptions of what "transform" means.
Joel
On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, < pehlivaniancharles at gmail.com <https://mail.python.org/mailman/listinfo/scikit-learn>> wrote:
/Not all data transformers have a transform method. For those that do, />/subset invariance is assumed as expressed />/in check_methods_subset_invariance(). It must be the case that />/T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic />/projections - PCA, kernel PCA, etc., but not for some manifold learning />/transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement />/of the data in space is a constrained optimization, may take into account />/the centroid of the dataset etc. />//>/The manifold learners have "batch" oos transform() methods that aren't />/implemented, and wouldn't pass that test. Instead, those that do - />/LocallyLinearEmbedding - use a pointwise version, essentially replacing a />/batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]: />//>/for i in range(X.shape[0]): />/X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i]) />//>/Where to implement the batch transform() methods for MDS, />/SpectralEmbedding, LocallyLinearEmbedding, etc? />//>/Another verb? Both batch and pointwise versions? The latter is easy to />/implement once the batch version exists. Relax the test conditions? />/transform() is necessary for oos testing, so necessary for cross />/validation. The batch versions should be preferred, although as it stands, />/the pointwise versions are. />//>/Thanks />/Charles Pehlivanian />/_______________________________________________ />/scikit-learn mailing list />/scikit-learn at python.org <https://mail.python.org/mailman/listinfo/scikit-learn> />/https://mail.python.org/mailman/listinfo/scikit-learn />//
This is what I thought we usually do. It looks like you said we are doing a greedy transform. I'm not sure I follow that. In particular for spectral embedding for example there is a pretty way to describe the transform and that's what we're doing. You could also look at doing transductive learning but that's not really the standard formulation, is it? Batch transform becomes greedy if one does: for x_i in X: X_new_i = self.transform(x_i) I said that LLE uses greedy algorithm. The algorithm implemented is pointwise. It may be that that's the only approach (in which case it's not greedy), but I don't think so - looks like all of the spectral embedding, lle, mds transforms have batch versions. So I probably shouldn't call it greedy. Taking a *true* batch transform and enclosing it in a loop like that - I'm calling that greedy. I'm honestly not sure if the LLE qualifies. Spectral embedding - agree, the method you refer to is implemented in fit_transform(). How to apply to oos points? Non-distributable, non-subset-invariant, optimal batch transform Can you give an example of that? Most of the manifold learners can be expressed as solutions to eigenvalue/vector problems. For MDS batch transform, form a new constrained double-centered distance matrix and solve a constrained least-squares problem that mimics the SVD solution to the eigenvalue problem. They're all like this - least-squares estimates for some constrained eigenvalue problem. The question is whether you want to solve the full problem, or solve on each point, adding one row and optimzing each time, ... that would be subset-invariant though. For this offline/batch approach to an oos transform, the only way I see to make it pass tests is to enclose it in a loop as above. That's what I see at least. On Tue, Jan 21, 2020 at 8:35 PM Andreas Mueller <t3kcit@gmail.com> wrote:
On 1/21/20 8:23 PM, Charles Pehlivanian wrote:
I understand - I'm kind of conflating the idea of data sample with test set, my view assumes there are a sample space of samples, might require rethinking the cross-validation setup...
I also think that part of it relies on the notion of online vs. offline algorithm. For offline fits, a batch transform (non-subset invariant) is preferred. For a transformer that can only be used in an online sense, or is primarily used that way, keep the invariant.
I see 3 options here - all I can say is that I don't vote for the first
+ No transform method on the manifold learners, so no cross-validation
This is what I thought we usually do. It looks like you said we are doing a greedy transform. I'm not sure I follow that. In particular for spectral embedding for example there is a pretty way to describe the transform and that's what we're doing. You could also look at doing transductive learning but that's not really the standard formulation, is it?
+ Pointwise, distributable, subset-invariant, suboptimal greedy transform
+ Non-distributable, non-subset-invariant, optimal batch transform
Can you give an example of that?
-Charles
On Mon., Jan. 20, 21:24:52 2020 <joel.nothman at gmail.com <scikit-learn%40python.org?Subject=Re%3A%20%5Bscikit-learn%5D%20Why%20is%20subset%20invariance%20necessary%20for%0A%20transfom%28%29%3F&In-Reply-To=%3CCAAkaFLWfWyu%2BDdQ3RX5tBays6jLX6A3W_QpqAcWn_RAxbRz5cQ%40mail.gmail.com%3E>> wrote
I think allowing subset invariance to not hold is making stronger
assumptions than we usually do about what it means to have a "test set". Having a transformation like this that relies on test set statistics implies that the test set is more than just selected samples, but rather that a large collection of samples is available at one time, and that it is in some sense sufficient or complete (no more samples are available that would give a better fit). So in a predictive modelling context you might have to set up your cross validation splits with this in mind.
In terms of API, the subset invariance constraint allows us to assume that the transformation can be distributed or parallelized over samples. I'm not sure whether we have exploited that assumption within scikit-learn or whether related projects do so.
I see the benefit of using such transformations in a prediction Pipeline, and really appreciate this challenge to our assumptions of what "transform" means.
Joel
On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, <pehlivaniancharles at gmail.com <https://mail.python.org/mailman/listinfo/scikit-learn>> wrote:
* Not all data transformers have a transform method. For those that do, *>* subset invariance is assumed as expressed *>* in check_methods_subset_invariance(). It must be the case that *>* T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic *>* projections - PCA, kernel PCA, etc., but not for some manifold learning *>* transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement *>* of the data in space is a constrained optimization, may take into account *>* the centroid of the dataset etc. *>>* The manifold learners have "batch" oos transform() methods that aren't *>* implemented, and wouldn't pass that test. Instead, those that do - *>* LocallyLinearEmbedding - use a pointwise version, essentially replacing a *>* batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]: *>>* for i in range(X.shape[0]): *>* X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i]) *>>* Where to implement the batch transform() methods for MDS, *>* SpectralEmbedding, LocallyLinearEmbedding, etc? *>>* Another verb? Both batch and pointwise versions? The latter is easy to *>* implement once the batch version exists. Relax the test conditions? *>* transform() is necessary for oos testing, so necessary for cross *>* validation. The batch versions should be preferred, although as it stands, *>* the pointwise versions are. *>>* Thanks *>* Charles Pehlivanian *>* _______________________________________________ *>* scikit-learn mailing list *>* scikit-learn at python.org <https://mail.python.org/mailman/listinfo/scikit-learn> *>* https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn> *>
To summarize - for mds, spectralembedding, it looks like there is no transform method that will satisfy both 1. fit(X).transform(X) == fit_transform(X) 2. transform(X)[i:i+1] == transform(X[i:i+1]) that's because the current fit_transform doesn't factor nicely into those 2 steps. The last step returns a subset of eigenvalues of a modified Gram matrix. For PCA, kernel PCA, LLE, fit_transform is something like: center data, do U,S,V = SVD, project data onto submatrix of V. The last step is matrix multiplication. Last step in transform methods there are np.dot(...). That factors nicely. There could be a transform_batch method for mds which would satisfy 1., then transform could call transform_batch rowwise, to satisfy 2, but no single method will work. I don't know if there is appetite for separation and modification of unittests involved. Charles On Tue, Jan 21, 2020 at 9:19 PM Charles Pehlivanian < pehlivaniancharles@gmail.com> wrote:
This is what I thought we usually do. It looks like you said we are doing a greedy transform. I'm not sure I follow that. In particular for spectral embedding for example there is a pretty way to describe the transform and that's what we're doing. You could also look at doing transductive learning but that's not really the standard formulation, is it?
Batch transform becomes greedy if one does:
for x_i in X: X_new_i = self.transform(x_i)
I said that LLE uses greedy algorithm. The algorithm implemented is pointwise. It may be that that's the only approach (in which case it's not greedy), but I don't think so - looks like all of the spectral embedding, lle, mds transforms have batch versions. So I probably shouldn't call it greedy. Taking a *true* batch transform and enclosing it in a loop like that - I'm calling that greedy. I'm honestly not sure if the LLE qualifies.
Spectral embedding - agree, the method you refer to is implemented in fit_transform(). How to apply to oos points?
Non-distributable, non-subset-invariant, optimal batch transform Can you give an example of that?
Most of the manifold learners can be expressed as solutions to eigenvalue/vector problems. For MDS batch transform, form a new constrained double-centered distance matrix and solve a constrained least-squares problem that mimics the SVD solution to the eigenvalue problem. They're all like this - least-squares estimates for some constrained eigenvalue problem. The question is whether you want to solve the full problem, or solve on each point, adding one row and optimzing each time, ... that would be subset-invariant though.
For this offline/batch approach to an oos transform, the only way I see to make it pass tests is to enclose it in a loop as above. That's what I see at least.
On Tue, Jan 21, 2020 at 8:35 PM Andreas Mueller <t3kcit@gmail.com> wrote:
On 1/21/20 8:23 PM, Charles Pehlivanian wrote:
I understand - I'm kind of conflating the idea of data sample with test set, my view assumes there are a sample space of samples, might require rethinking the cross-validation setup...
I also think that part of it relies on the notion of online vs. offline algorithm. For offline fits, a batch transform (non-subset invariant) is preferred. For a transformer that can only be used in an online sense, or is primarily used that way, keep the invariant.
I see 3 options here - all I can say is that I don't vote for the first
+ No transform method on the manifold learners, so no cross-validation
This is what I thought we usually do. It looks like you said we are doing a greedy transform. I'm not sure I follow that. In particular for spectral embedding for example there is a pretty way to describe the transform and that's what we're doing. You could also look at doing transductive learning but that's not really the standard formulation, is it?
+ Pointwise, distributable, subset-invariant, suboptimal greedy transform
+ Non-distributable, non-subset-invariant, optimal batch transform
Can you give an example of that?
-Charles
On Mon., Jan. 20, 21:24:52 2020 <joel.nothman at gmail.com <scikit-learn%40python.org?Subject=Re%3A%20%5Bscikit-learn%5D%20Why%20is%20subset%20invariance%20necessary%20for%0A%20transfom%28%29%3F&In-Reply-To=%3CCAAkaFLWfWyu%2BDdQ3RX5tBays6jLX6A3W_QpqAcWn_RAxbRz5cQ%40mail.gmail.com%3E>> wrote
I think allowing subset invariance to not hold is making stronger
assumptions than we usually do about what it means to have a "test set". Having a transformation like this that relies on test set statistics implies that the test set is more than just selected samples, but rather that a large collection of samples is available at one time, and that it is in some sense sufficient or complete (no more samples are available that would give a better fit). So in a predictive modelling context you might have to set up your cross validation splits with this in mind.
In terms of API, the subset invariance constraint allows us to assume that the transformation can be distributed or parallelized over samples. I'm not sure whether we have exploited that assumption within scikit-learn or whether related projects do so.
I see the benefit of using such transformations in a prediction Pipeline, and really appreciate this challenge to our assumptions of what "transform" means.
Joel
On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, <pehlivaniancharles at gmail.com <https://mail.python.org/mailman/listinfo/scikit-learn>> wrote:
* Not all data transformers have a transform method. For those that do, *>* subset invariance is assumed as expressed *>* in check_methods_subset_invariance(). It must be the case that *>* T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic *>* projections - PCA, kernel PCA, etc., but not for some manifold learning *>* transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement *>* of the data in space is a constrained optimization, may take into account *>* the centroid of the dataset etc. *>>* The manifold learners have "batch" oos transform() methods that aren't *>* implemented, and wouldn't pass that test. Instead, those that do - *>* LocallyLinearEmbedding - use a pointwise version, essentially replacing a *>* batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]: *>>* for i in range(X.shape[0]): *>* X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i]) *>>* Where to implement the batch transform() methods for MDS, *>* SpectralEmbedding, LocallyLinearEmbedding, etc? *>>* Another verb? Both batch and pointwise versions? The latter is easy to *>* implement once the batch version exists. Relax the test conditions? *>* transform() is necessary for oos testing, so necessary for cross *>* validation. The batch versions should be preferred, although as it stands, *>* the pointwise versions are. *>>* Thanks *>* Charles Pehlivanian *>* _______________________________________________ *>* scikit-learn mailing list *>* scikit-learn at python.org <https://mail.python.org/mailman/listinfo/scikit-learn> *>* https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn> *>
Hi, I don't see what stopwords are used by CountVectorizer with stop_wordsstring = ‘english’. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction... Is there a way to figure it out? Thanks. -- Regards, Peng
Hi Peng, check out https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_ext... Best, Sebastian
On Jan 27, 2020, at 2:30 PM, Peng Yu <pengyu.ut@gmail.com> wrote:
Hi,
I don't see what stopwords are used by CountVectorizer with stop_wordsstring = ‘english’.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction...
Is there a way to figure it out? Thanks.
-- Regards, Peng _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
See also https://www.aclweb.org/anthology/W18-2502/ for a critique of this and other stop word lists.
Hi Peng, I believe the set of English stop words used across all token vectorizers can be found in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_ext.... Cheers, Jon On Mon, Jan 27, 2020 at 3:33 PM Peng Yu <pengyu.ut@gmail.com> wrote:
Hi,
I don't see what stopwords are used by CountVectorizer with stop_wordsstring = ‘english’.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction...
Is there a way to figure it out? Thanks.
-- Regards, Peng _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi, https://github.com/scikit-learn/scikit-learn/blob/b194674c42d54b26137a456c51... Regards Christian Peng Yu <pengyu.ut@gmail.com> schrieb am Mo., 27. Jan. 2020, 21:31:
Hi,
I don't see what stopwords are used by CountVectorizer with stop_wordsstring = ‘english’.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction...
Is there a way to figure it out? Thanks.
-- Regards, Peng _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (7)
-
Andreas Mueller -
Charles Pehlivanian -
Christian Braune -
Joel Nothman -
Jonathan Cusick -
Peng Yu -
Sebastian Raschka