<div dir="ltr">Hi Markus,<div><br></div><div><div><font color="#212121">I find that in current LDA implementation we included</font></div>"E[log p(beta | eta) - log q (beta | lambda)]" in the approx bound function and use it to calculate perplexity.</div><div><div>But this part was not included in the likelihood function in Blei's C implementation.</div><div><br></div><div>Maybe this caused some difference.</div><div>(I am not sure which one is correct. will need some time to compare the difference.)</div></div><div><br></div><div><div>Best,</div><div>Chyi-Kwei</div></div><div><br></div><div>reference code:</div><div>sklearn <a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/online_lda.py#L707-L709">https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/online_lda.py#L707-L709</a></div><div><div><br></div><div>original onlineldavb <a href="https://github.com/blei-lab/onlineldavb/blob/master/onlineldavb.py#L384-L388">https://github.com/blei-lab/onlineldavb/blob/master/onlineldavb.py#L384-L388</a></div><div><br></div><div>Blei's C implementation <a href="https://github.com/blei-lab/lda-c/blob/master/lda-inference.c#L94-L127">https://github.com/blei-lab/lda-c/blob/master/lda-inference.c#L94-L127</a></div></div><div><br></div><div><br></div><div class="gmail_quote"><div dir="ltr">On Wed, Oct 4, 2017 at 7:56 AM Markus Konrad <<a href="mailto:markus.konrad@wzb.eu" target="_blank">markus.konrad@wzb.eu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi there,<br>

<br>

I'm trying to find the optimal number of topics for Topic Modeling with<br>

Latent Dirichlet Allocation. I implemented a 5-fold cross validation<br>

method similar to the one described and implemented in R here [1]. I<br>

basically split the full data into 5 equal sized chunks. Then for each<br>

fold (`cur_fold`), 4 of 5 chunks are used for training and 1 for<br>

validation using the `perplexity()` method on the held-out data set:<br>

<br>

```<br>

dtm_train = data[split_folds != cur_fold, :]<br>

dtm_valid = data[split_folds == cur_fold, :]<br>

<br>

lda_instance = LatentDirichletAllocation(**params)<br>

lda_instance.fit(dtm_train)<br>

<br>

perpl = lda_instance.perplexity(dtm_valid)<br>

```<br>

<br>

This is done for a set of parameters, basically for a varying number of<br>

topics (n_components).<br>

<br>

I tried this out with a number of different data sets, for example with<br>

the "Associated Press" data mentioned in [1], which is the sample data<br>

for David M. Blei's LDA C implementation [2].<br>

Using the same data, I would expect that I get similar results as in<br>

[1], which found that a model with ~100 topics fits the AP data best.<br>

However, my experiments always show that the perplexity is exponentially<br>

growing with the number of topics. The "best" model is always the one<br>

with the lowest number of topics. The same happens with other data sets,<br>

too. Similar results happen when calculating the perplexity on the full<br>

training data alone (so no cross validation on held-out data).<br>

<br>

Does anyone have an idea why these results are not consistent with those<br>

from [1]? Is the perplexity() method not the correct method to use when<br>

evaluating held-out data? Could it be a problem, that some of the<br>

columns of the training data term frequency matrix are all-zero?<br>

<br>

Best,<br>

Markus<br>

<br>

<br>

[1] <a href="http://ellisp.github.io/blog/2017/01/05/topic-model-cv" rel="noreferrer" target="_blank">http://ellisp.github.io/blog/2017/01/05/topic-model-cv</a><br>

[2]<br>

<a href="https://web.archive.org/web/20160930175144/http://www.cs.princeton.edu/~blei/lda-c/index.html" rel="noreferrer" target="_blank">https://web.archive.org/web/20160930175144/http://www.cs.princeton.edu/~blei/lda-c/index.html</a><br>

<br>

_______________________________________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>

</blockquote></div></div>