[scikit-learn] Using perplexity from LatentDirichletAllocation for cross validation of Topic Models

Fri Oct 6 12:38:36 EDT 2017

Hi Markus,

I find that in current LDA implementation we included
"E[log p(beta | eta) - log q (beta | lambda)]" in the approx bound function
and use it to calculate perplexity.
But this part was not included in the likelihood function in Blei's C
implementation.

Maybe this caused some difference.
(I am not sure which one is correct. will need some time to compare the
difference.)

Best,
Chyi-Kwei

reference code:
sklearn
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/online_lda.py#L707-L709

original onlineldavb
https://github.com/blei-lab/onlineldavb/blob/master/onlineldavb.py#L384-L388

Blei's C implementation
https://github.com/blei-lab/lda-c/blob/master/lda-inference.c#L94-L127

On Wed, Oct 4, 2017 at 7:56 AM Markus Konrad <markus.konrad at wzb.eu> wrote:

> Hi there,
>
> I'm trying to find the optimal number of topics for Topic Modeling with
> Latent Dirichlet Allocation. I implemented a 5-fold cross validation
> method similar to the one described and implemented in R here [1]. I
> basically split the full data into 5 equal sized chunks. Then for each
> fold (`cur_fold`), 4 of 5 chunks are used for training and 1 for
> validation using the `perplexity()` method on the held-out data set:
>
> ```
> dtm_train = data[split_folds != cur_fold, :]
> dtm_valid = data[split_folds == cur_fold, :]
>
> lda_instance = LatentDirichletAllocation(**params)
> lda_instance.fit(dtm_train)
>
> perpl = lda_instance.perplexity(dtm_valid)
> ```
>
> This is done for a set of parameters, basically for a varying number of
> topics (n_components).
>
> I tried this out with a number of different data sets, for example with
> the "Associated Press" data mentioned in [1], which is the sample data
> for David M. Blei's LDA C implementation [2].
> Using the same data, I would expect that I get similar results as in
> [1], which found that a model with ~100 topics fits the AP data best.
> However, my experiments always show that the perplexity is exponentially
> growing with the number of topics. The "best" model is always the one
> with the lowest number of topics. The same happens with other data sets,
> too. Similar results happen when calculating the perplexity on the full
> training data alone (so no cross validation on held-out data).
>
> Does anyone have an idea why these results are not consistent with those
> from [1]? Is the perplexity() method not the correct method to use when
> evaluating held-out data? Could it be a problem, that some of the
> columns of the training data term frequency matrix are all-zero?
>
> Best,
> Markus
>
>
> [1] http://ellisp.github.io/blog/2017/01/05/topic-model-cv
> [2]
>
> https://web.archive.org/web/20160930175144/http://www.cs.princeton.edu/~blei/lda-c/index.html
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171006/638bb2ee/attachment.html>