[scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

Andreas Mueller t3kcit at gmail.com
Mon Sep 18 12:59:44 EDT 2017


For very few documents, Gibbs sampling is likely to work better - or 
rather, Gibbs sampling usually works
better given enough runtime, and for so few documents, runtime is not an 
issue.
The length of the documents don't matter, only the size of the vocabulary.
Also, hyper parameter choices might need to be different for Gibbs 
sampling vs variational inference.

On 09/18/2017 12:26 PM, Markus Konrad wrote:
> Hi Chyi-Kwei,
>
> thanks for digging into this. I made similar observations with Gensim
> when using only a small number of (big) documents. Gensim also uses the
> Online Variational Bayes approach (Hoffman et al.). So could it be that
> the Hoffman et al. method is problematic in such scenarios? I found that
> Gibbs sampling based implementations provide much more informative
> topics in this case.
>
> If this was the case, then if I'd slice the documents in some way (say
> every N paragraphs become a "document") then I should get better results
> with scikit-learn and Gensim, right? I think I'll try this out tomorrow.
>
> Best,
> Markus
>
>
>
>> Date: Sun, 17 Sep 2017 23:52:51 +0000
>> From: chyi-kwei yau <chyikwei.yau at gmail.com>
>> To: Scikit-learn mailing list <scikit-learn at python.org>
>> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find
>> 	topics in NLTK Gutenberg corpus?
>> Message-ID:
>> 	<CAK-jh0Ygd8fSdJom+gdDOHvAYCPuJVHHX77qcd+d4_xm6vi9yA at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi Markus,
>>
>> I tried your code and find the issue might be there are only 18 docs
>> in the Gutenberg
>> corpus.
>> if you print out transformed doc topic distribution, you will see a lot of
>> topics are not used.
>> And since there is no words assigned to those topics, the weights will be
>> equal to`topic_word_prior` parameter.
>>
>> You can print out the transformed doc topic distributions like this:
>> -------------
>>>>> doc_distr = lda.fit_transform(tf)
>>>>> for d in doc_distr:
>> ...     print np.where(d > 0.001)[0]
>> ...
>> [17 27]
>> [17 27]
>> [17 27 28]
>> [14]
>> [ 2  4 28]
>> [ 2  4 15 21 27 28]
>> [1]
>> [ 1  2 17 21 27 28]
>> [ 2 15 17 22 28]
>> [ 2 17 21 22 27 28]
>> [ 2 15 17 28]
>> [ 2 17 21 27 28]
>> [ 2 14 15 17 21 22 27 28]
>> [15 22]
>> [ 8 11]
>> [8]
>> [ 8 24]
>> [ 2 14 15 22]
>>
>> and my full test scripts are here:
>> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826
>>
>> Best,
>> Chyi-Kwei
>>
>>
>> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad <markus.konrad at wzb.eu> wrote:
>>
>>> Hi there,
>>>
>>> I'm trying out sklearn's latent Dirichlet allocation implementation for
>>> topic modeling. The code from the official example [1] works just fine and
>>> the extracted topics look reasonable. However, when I try other corpora,
>>> for example the Gutenberg corpus from NLTK, most of the extracted topics
>>> are garbage. See this example output, when trying to get 30 topics:
>>>
>>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
>>> fatiguing (0.01)
>>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane
>>> (301.83)
>>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
>>> fatiguing (0.01)
>>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother
>>> (55.27)
>>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles
>>> (166.21)
>>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
>>> fatiguing (0.01)
>>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
>>> fatiguing (0.01)
>>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01)
>>> fatiguing (0.01)
>>> ...
>>>
>>> Many topics tend to have the same weights, all equal to the
>>> `topic_word_prior` parameter.
>>>
>>> This is my script:
>>>
>>> import nltk
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.decomposition import LatentDirichletAllocation
>>>
>>> def print_top_words(model, feature_names, n_top_words):
>>>      for topic_idx, topic in enumerate(model.components_):
>>>          message = "Topic #%d: " % topic_idx
>>>          message += " ".join([feature_names[i] + " (" + str(round(topic[i],
>>> 2)) + ")"
>>>                               for i in topic.argsort()[:-n_top_words -
>>> 1:-1]])
>>>          print(message)
>>>
>>>
>>> data_samples = [nltk.corpus.gutenberg.raw(f_id)
>>>                 for f_id in nltk.corpus.gutenberg.fileids()]
>>>
>>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
>>>                                  stop_words='english')
>>> tf = tf_vectorizer.fit_transform(data_samples)
>>>
>>> lda = LatentDirichletAllocation(n_components=30,
>>>                                  learning_method='batch',
>>>                                  n_jobs=-1,  # all CPUs
>>>                                  verbose=1,
>>>                                  evaluate_every=10,
>>>                                  max_iter=1000,
>>>                                  doc_topic_prior=0.1,
>>>                                  topic_word_prior=0.01,
>>>                                  random_state=1)
>>>
>>> lda.fit(tf)
>>> tf_feature_names = tf_vectorizer.get_feature_names()
>>> print_top_words(lda, tf_feature_names, 5)
>>>
>>> Is there a problem in how I set up the LatentDirichletAllocation instance
>>> or pass the data? I tried out different parameter settings, but none of
>>> them provided good results for that corpus. I also tried out alternative
>>> implementations (like the lda package [2]) and those were able to find
>>> reasonable topics.
>>>
>>> Best,
>>> Markus
>>>
>>>
>>> [1]
>>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
>>> [2] http://pythonhosted.org/lda/
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



More information about the scikit-learn mailing list