Latent Semantic Analysis (LSA) and TrucatedSVD
Hi all, I have a question about using the TruncatedSVD method for performing Latent Semantic Analysis/Indexing (LSA/LSI). The docs imply that simply applying TruncatedSVD to a tf-idf matrice is sufficient (cf. http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.Trunc...), but I'm wondering about that. As far as I understood for LSA one computes a truncated SVD decomposition of the tf-idf matrix X (n_features x n_samples), X ≈ U @ Sigma @ V.T and then for a document vector d, the projection is computed as, d_proj = d.T @ U @ Sigma⁻¹ (source: http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf) However, TruncatedSVD.fit_transform only computes, d_proj = d.T @ U and what's more does not store the singular values (Sigma) internally, so it cannot be easily applied afterwards. (the above notation are transposed with respect to those in the scikit learn docs). For instance, I have tried reproducing LSA decomposition from literature and I'm not getting the expected results unless I perform an additional normalization by the Sigma matrix: https://gist.github.com/rth/3af30c60bece7db4207821a6dddc5e8d I was wondering if I am missing something here? Thank you, -- Roman
Looks like they apply whitening, which is not implemented in TruncatedSVD. I guess we could add that option. It's equivalent to using a StandardScaler after the TruncatedSVD. Can you try and see if that reproduces the results? On 08/26/2016 10:09 AM, Roman Yurchak wrote:
Hi all,
I have a question about using the TruncatedSVD method for performing Latent Semantic Analysis/Indexing (LSA/LSI). The docs imply that simply applying TruncatedSVD to a tf-idf matrice is sufficient (cf. http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.Trunc...), but I'm wondering about that.
As far as I understood for LSA one computes a truncated SVD decomposition of the tf-idf matrix X (n_features x n_samples), X ≈ U @ Sigma @ V.T and then for a document vector d, the projection is computed as, d_proj = d.T @ U @ Sigma⁻¹ (source: http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf) However, TruncatedSVD.fit_transform only computes, d_proj = d.T @ U and what's more does not store the singular values (Sigma) internally, so it cannot be easily applied afterwards. (the above notation are transposed with respect to those in the scikit learn docs).
For instance, I have tried reproducing LSA decomposition from literature and I'm not getting the expected results unless I perform an additional normalization by the Sigma matrix: https://gist.github.com/rth/3af30c60bece7db4207821a6dddc5e8d
I was wondering if I am missing something here? Thank you,
I am not sure this is exactly the same because we do not center the data in the TruncatedSVD case (as opposed to the real PCA case where whitening is the same as calling StandardScaler). Having an option to normalize the transformed data by sigma seems like a good idea but we should probably not call that whitening. -- Olivier
BTW Roman, the examples in your gist would make a great non-regression test for this new feature. Please feel free to submit a PR. -- Olivier
If you do "with_mean=False" it should be the same, right? On 08/27/2016 12:20 PM, Olivier Grisel wrote:
I am not sure this is exactly the same because we do not center the data in the TruncatedSVD case (as opposed to the real PCA case where whitening is the same as calling StandardScaler).
Having an option to normalize the transformed data by sigma seems like a good idea but we should probably not call that whitening.
Thank you for all your responses! In the LSA what is equivalent, I think, is - to apply a L2 normalization (not the StandardScaler) after the LSA and then compute the cosine similarity between document vectors simply as a dot product. - not apply the L2 normalization and call the `cosine_similarity` function instead. I have applied this normalization to the previous example, and it produces indeed equivalent results (i.e. does not solve the problem). Opening an issue on this for further discussion https://github.com/scikit-learn/scikit-learn/issues/7283 Thanks for your feedback! -- Roman On 28/08/16 18:20, Andy wrote:
If you do "with_mean=False" it should be the same, right?
On 08/27/2016 12:20 PM, Olivier Grisel wrote:
I am not sure this is exactly the same because we do not center the data in the TruncatedSVD case (as opposed to the real PCA case where whitening is the same as calling StandardScaler).
Having an option to normalize the transformed data by sigma seems like a good idea but we should probably not call that whitening.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (4)
-
Andreas Mueller -
Andy -
Olivier Grisel -
Roman Yurchak