[scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD

Fri Aug 26 10:09:15 EDT 2016

Hi all,

I have a question about using the TruncatedSVD method for performing
Latent Semantic Analysis/Indexing (LSA/LSI). The docs imply that simply
applying TruncatedSVD to a tf-idf matrice is sufficient (cf.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html),
but I'm wondering about that.

As far as I understood for LSA one computes a truncated SVD
decomposition of the tf-idf matrix X (n_features x n_samples),
      X ≈ U @ Sigma @ V.T
and then for a document vector d, the projection is computed as,
      d_proj = d.T @ U @ Sigma⁻¹
(source: http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf)
However, TruncatedSVD.fit_transform only computes,
      d_proj = d.T @ U
and what's more does not store the singular values (Sigma) internally,
so it cannot be easily applied afterwards.
(the above notation are transposed with respect to those in the scikit
learn docs).

For instance, I have tried reproducing LSA decomposition from literature
and I'm not getting the expected results unless I perform an additional
normalization by the Sigma matrix:
https://gist.github.com/rth/3af30c60bece7db4207821a6dddc5e8d

I was wondering if I am missing something here?
Thank you,
-- 
Roman