[scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD

Fri Aug 26 10:55:41 EDT 2016

Looks like they apply whitening, which is not implemented in TruncatedSVD.
I guess we could add that option. It's equivalent to using a 
StandardScaler after the TruncatedSVD.
Can you try and see if that reproduces the results?

On 08/26/2016 10:09 AM, Roman Yurchak wrote:
> Hi all,
>
> I have a question about using the TruncatedSVD method for performing
> Latent Semantic Analysis/Indexing (LSA/LSI). The docs imply that simply
> applying TruncatedSVD to a tf-idf matrice is sufficient (cf.
> http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html),
> but I'm wondering about that.
>
> As far as I understood for LSA one computes a truncated SVD
> decomposition of the tf-idf matrix X (n_features x n_samples),
>        X ≈ U @ Sigma @ V.T
> and then for a document vector d, the projection is computed as,
>        d_proj = d.T @ U @ Sigma⁻¹
> (source: http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf)
> However, TruncatedSVD.fit_transform only computes,
>        d_proj = d.T @ U
> and what's more does not store the singular values (Sigma) internally,
> so it cannot be easily applied afterwards.
> (the above notation are transposed with respect to those in the scikit
> learn docs).
>
> For instance, I have tried reproducing LSA decomposition from literature
> and I'm not getting the expected results unless I perform an additional
> normalization by the Sigma matrix:
> https://gist.github.com/rth/3af30c60bece7db4207821a6dddc5e8d
>
> I was wondering if I am missing something here?
> Thank you,