[scikit-learn] Latent Dirichlet Allocation transformation of data with pre-determined topic_word distribution

Wed Dec 7 10:22:21 EST 2016

Hello,

I am running Latent Dirichlet Allocation 100 times on bootstrapped versions
of a dataset, gathering up the topic_word matrix from each run
(components_), and merging it into a final cleaner topic_word matrix.
Because I am bootstrapping documents, not every document is in every run
and so it isn't clear how to get a final merged doc_topic distribution. I
was wondering if there is any way to run the LatentDirichletAllocation
transform method with a pre-determined components_ matrix. I tried this out
in a few ways none of which worked.

from sklearn.decomposition import LatentDirichletAllocation as skLDA
mod = skLDA(n_topics=7, learning_method='batch', doc_topic_prior=.1,
            topic_word_prior=.1, evaluate_every=1)
mod.components_ = median_beta # my collapsed estimates of this matrix
topic_usage = mod.transform(word_matrix)

crashes with:

AttributeError: 'LatentDirichletAllocation' object has no attribute
'exp_dirichlet_component_'

I try to correct this with:

            mod.components_ = median_beta
            mod.exp_dirichlet_component_ = np.exp(
            _dirichlet_expectation_2d(mod.components_))
            mod._init_latent_vars(components_.shape[1])

and now transform will complete will run but the results don't match in the
least what I would expect after looking at multiple LDA runs. Note that
this kind of functionality is available for NMF where you can run:

    (W, H, niter) = non_negative_factorization(wordmatrix, H=median_beta,
n_components=median_beta.shape[0], update_H=False)

Thanks for any insight or help you can provide.

Best,
Dylan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161207/b7e9bde0/attachment-0001.html>