<div dir="ltr">Hi Markus,<div><br></div><div><div><font color="#212121">I find that in current LDA implementation we included</font></div>"E[log p(beta | eta) - log q (beta | lambda)]" in the approx bound function and use it to calculate perplexity.</div><div><div>But this part was not included in the likelihood function in Blei's C implementation.</div><div><br></div><div>Maybe this caused some difference.</div><div>(I am not sure which one is correct. will need some time to compare the difference.)</div></div><div><br></div><div><div>Best,</div><div>Chyi-Kwei</div></div><div><br></div><div>reference code:</div><div>sklearn <a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/online_lda.py#L707-L709">https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/online_lda.py#L707-L709</a></div><div><div><br></div><div>original onlineldavb <a href="https://github.com/blei-lab/onlineldavb/blob/master/onlineldavb.py#L384-L388">https://github.com/blei-lab/onlineldavb/blob/master/onlineldavb.py#L384-L388</a></div><div><br></div><div>Blei's C implementation <a href="https://github.com/blei-lab/lda-c/blob/master/lda-inference.c#L94-L127">https://github.com/blei-lab/lda-c/blob/master/lda-inference.c#L94-L127</a></div></div><div><br></div><div><br></div><div class="gmail_quote"><div dir="ltr">On Wed, Oct 4, 2017 at 7:56 AM Markus Konrad <<a href="mailto:markus.konrad@wzb.eu" target="_blank">markus.konrad@wzb.eu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi there,<br>
<br>
I'm trying to find the optimal number of topics for Topic Modeling with<br>
Latent Dirichlet Allocation. I implemented a 5-fold cross validation<br>
method similar to the one described and implemented in R here [1]. I<br>
basically split the full data into 5 equal sized chunks. Then for each<br>
fold (`cur_fold`), 4 of 5 chunks are used for training and 1 for<br>
validation using the `perplexity()` method on the held-out data set:<br>
<br>
```<br>
dtm_train = data[split_folds != cur_fold, :]<br>
dtm_valid = data[split_folds == cur_fold, :]<br>
<br>
lda_instance = LatentDirichletAllocation(**params)<br>
lda_instance.fit(dtm_train)<br>
<br>
perpl = lda_instance.perplexity(dtm_valid)<br>
```<br>
<br>
This is done for a set of parameters, basically for a varying number of<br>
topics (n_components).<br>
<br>
I tried this out with a number of different data sets, for example with<br>
the "Associated Press" data mentioned in [1], which is the sample data<br>
for David M. Blei's LDA C implementation [2].<br>
Using the same data, I would expect that I get similar results as in<br>
[1], which found that a model with ~100 topics fits the AP data best.<br>
However, my experiments always show that the perplexity is exponentially<br>
growing with the number of topics. The "best" model is always the one<br>
with the lowest number of topics. The same happens with other data sets,<br>
too. Similar results happen when calculating the perplexity on the full<br>
training data alone (so no cross validation on held-out data).<br>
<br>
Does anyone have an idea why these results are not consistent with those<br>
from [1]? Is the perplexity() method not the correct method to use when<br>
evaluating held-out data? Could it be a problem, that some of the<br>
columns of the training data term frequency matrix are all-zero?<br>
<br>
Best,<br>
Markus<br>
<br>
<br>
[1] <a href="http://ellisp.github.io/blog/2017/01/05/topic-model-cv" rel="noreferrer" target="_blank">http://ellisp.github.io/blog/2017/01/05/topic-model-cv</a><br>
[2]<br>
<a href="https://web.archive.org/web/20160930175144/http://www.cs.princeton.edu/~blei/lda-c/index.html" rel="noreferrer" target="_blank">https://web.archive.org/web/20160930175144/http://www.cs.princeton.edu/~blei/lda-c/index.html</a><br>
<br>
_______________________________________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
</blockquote></div></div>