[scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

Tue Sep 19 12:07:51 EDT 2017

I'm actually surprised the gibbs sampling gave useful results with so 
little data.
And splitting the documents results in very different data. It has a lot 
more information.
How many topics did you use?

Also: PR for docs welcome!

On 09/19/2017 04:26 AM, Markus Konrad wrote:
> This is indeed interesting. I didn't know that there are so big
> differences between these approaches. I split the 18 documents into
> sub-documents of 5 paragraphs each, so that I got around 10k of these
> sub-documents. Now, scikit-learn and gensim deliver much better results,
> quite similar to those from a Gibbs sampling based implementation. So it
> was basically the same data, just split in a different way.
>
> I think the disadvantages/limits of the Variational Bayes approach
> should be mentioned in the documentation.
>
> Best,
> Markus
>
>
>
> On 09/18/2017 06:59 PM, Andreas Mueller wrote:
>> For very few documents, Gibbs sampling is likely to work better - or
>> rather, Gibbs sampling usually works
>> better given enough runtime, and for so few documents, runtime is not an
>> issue.
>> The length of the documents don't matter, only the size of the vocabulary.
>> Also, hyper parameter choices might need to be different for Gibbs
>> sampling vs variational inference.
>>
>> On 09/18/2017 12:26 PM, Markus Konrad wrote:
>>> Hi Chyi-Kwei,
>>>
>>> thanks for digging into this. I made similar observations with Gensim
>>> when using only a small number of (big) documents. Gensim also uses the
>>> Online Variational Bayes approach (Hoffman et al.). So could it be that
>>> the Hoffman et al. method is problematic in such scenarios? I found that
>>> Gibbs sampling based implementations provide much more informative
>>> topics in this case.
>>>
>>> If this was the case, then if I'd slice the documents in some way (say
>>> every N paragraphs become a "document") then I should get better results
>>> with scikit-learn and Gensim, right? I think I'll try this out tomorrow.
>>>
>>> Best,
>>> Markus
>>>
>>>
>>>
>>>> Date: Sun, 17 Sep 2017 23:52:51 +0000
>>>> From: chyi-kwei yau <chyikwei.yau at gmail.com>
>>>> To: Scikit-learn mailing list <scikit-learn at python.org>
>>>> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find
>>>>      topics in NLTK Gutenberg corpus?
>>>> Message-ID:
>>>>      <CAK-jh0Ygd8fSdJom+gdDOHvAYCPuJVHHX77qcd+d4_xm6vi9yA at mail.gmail.com>
>>>> Content-Type: text/plain; charset="utf-8"
>>>>
>>>> Hi Markus,
>>>>
>>>> I tried your code and find the issue might be there are only 18 docs
>>>> in the Gutenberg
>>>> corpus.
>>>> if you print out transformed doc topic distribution, you will see a
>>>> lot of
>>>> topics are not used.
>>>> And since there is no words assigned to those topics, the weights
>>>> will be
>>>> equal to`topic_word_prior` parameter.
>>>>
>>>> You can print out the transformed doc topic distributions like this:
>>>> -------------
>>>>>>> doc_distr = lda.fit_transform(tf)
>>>>>>> for d in doc_distr:
>>>> ...     print np.where(d > 0.001)[0]
>>>> ...
>>>> [17 27]
>>>> [17 27]
>>>> [17 27 28]
>>>> [14]
>>>> [ 2  4 28]
>>>> [ 2  4 15 21 27 28]
>>>> [1]
>>>> [ 1  2 17 21 27 28]
>>>> [ 2 15 17 22 28]
>>>> [ 2 17 21 22 27 28]
>>>> [ 2 15 17 28]
>>>> [ 2 17 21 27 28]
>>>> [ 2 14 15 17 21 22 27 28]
>>>> [15 22]
>>>> [ 8 11]
>>>> [8]
>>>> [ 8 24]
>>>> [ 2 14 15 22]
>>>>
>>>> and my full test scripts are here:
>>>> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826
>>>>
>>>> Best,
>>>> Chyi-Kwei
>>>>
>>>>
>>>> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad <markus.konrad at wzb.eu>
>>>> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I'm trying out sklearn's latent Dirichlet allocation implementation for
>>>>> topic modeling. The code from the official example [1] works just
>>>>> fine and
>>>>> the extracted topics look reasonable. However, when I try other
>>>>> corpora,
>>>>> for example the Gutenberg corpus from NLTK, most of the extracted
>>>>> topics
>>>>> are garbage. See this example output, when trying to get 30 topics:
>>>>>
>>>>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>> (0.01)
>>>>> fatiguing (0.01)
>>>>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane
>>>>> (301.83)
>>>>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>> (0.01)
>>>>> fatiguing (0.01)
>>>>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother
>>>>> (55.27)
>>>>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07)
>>>>> charles
>>>>> (166.21)
>>>>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>> (0.01)
>>>>> fatiguing (0.01)
>>>>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>> (0.01)
>>>>> fatiguing (0.01)
>>>>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>> (0.01)
>>>>> fatiguing (0.01)
>>>>> ...
>>>>>
>>>>> Many topics tend to have the same weights, all equal to the
>>>>> `topic_word_prior` parameter.
>>>>>
>>>>> This is my script:
>>>>>
>>>>> import nltk
>>>>> from sklearn.feature_extraction.text import CountVectorizer
>>>>> from sklearn.decomposition import LatentDirichletAllocation
>>>>>
>>>>> def print_top_words(model, feature_names, n_top_words):
>>>>>       for topic_idx, topic in enumerate(model.components_):
>>>>>           message = "Topic #%d: " % topic_idx
>>>>>           message += " ".join([feature_names[i] + " (" +
>>>>> str(round(topic[i],
>>>>> 2)) + ")"
>>>>>                                for i in topic.argsort()[:-n_top_words -
>>>>> 1:-1]])
>>>>>           print(message)
>>>>>
>>>>>
>>>>> data_samples = [nltk.corpus.gutenberg.raw(f_id)
>>>>>                  for f_id in nltk.corpus.gutenberg.fileids()]
>>>>>
>>>>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
>>>>>                                   stop_words='english')
>>>>> tf = tf_vectorizer.fit_transform(data_samples)
>>>>>
>>>>> lda = LatentDirichletAllocation(n_components=30,
>>>>>                                   learning_method='batch',
>>>>>                                   n_jobs=-1,  # all CPUs
>>>>>                                   verbose=1,
>>>>>                                   evaluate_every=10,
>>>>>                                   max_iter=1000,
>>>>>                                   doc_topic_prior=0.1,
>>>>>                                   topic_word_prior=0.01,
>>>>>                                   random_state=1)
>>>>>
>>>>> lda.fit(tf)
>>>>> tf_feature_names = tf_vectorizer.get_feature_names()
>>>>> print_top_words(lda, tf_feature_names, 5)
>>>>>
>>>>> Is there a problem in how I set up the LatentDirichletAllocation
>>>>> instance
>>>>> or pass the data? I tried out different parameter settings, but none of
>>>>> them provided good results for that corpus. I also tried out
>>>>> alternative
>>>>> implementations (like the lda package [2]) and those were able to find
>>>>> reasonable topics.
>>>>>
>>>>> Best,
>>>>> Markus
>>>>>
>>>>>
>>>>> [1]
>>>>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
>>>>>
>>>>> [2] http://pythonhosted.org/lda/
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn