[scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

Wed Sep 20 03:18:40 EDT 2017

I tried it with 12 topics (that's the number that minimized the log
likelihood) and there were also some very general topics. But the Gibbs
sampling didn't extract "empty topics" (those with all weights equal to
`topic_word_prior`) as opposed to sklearn's implementation. This is what
puzzled me.

It isn't actually "little" data. The documents themselves are quite big.
But I think that this is where my thinking went wrong initially. I
thought that if 18 big documents cover a certain set of topics, then if
I split these documents into more, but smaller documents, a similar set
of topics should be discovered. But you're right, the latter contains
more information. Taken to an extreme: If I had only 1 document, it
wouldn't be possible to find the topics in there with LDA.

Best,
Markus

On 09/19/2017 06:07 PM, Andreas Mueller wrote:
> I'm actually surprised the gibbs sampling gave useful results with so
> little data.
> And splitting the documents results in very different data. It has a lot
> more information.
> How many topics did you use?
> 
> Also: PR for docs welcome!
> 
> On 09/19/2017 04:26 AM, Markus Konrad wrote:
>> This is indeed interesting. I didn't know that there are so big
>> differences between these approaches. I split the 18 documents into
>> sub-documents of 5 paragraphs each, so that I got around 10k of these
>> sub-documents. Now, scikit-learn and gensim deliver much better results,
>> quite similar to those from a Gibbs sampling based implementation. So it
>> was basically the same data, just split in a different way.
>>
>> I think the disadvantages/limits of the Variational Bayes approach
>> should be mentioned in the documentation.
>>
>> Best,
>> Markus
>>
>>
>>
>> On 09/18/2017 06:59 PM, Andreas Mueller wrote:
>>> For very few documents, Gibbs sampling is likely to work better - or
>>> rather, Gibbs sampling usually works
>>> better given enough runtime, and for so few documents, runtime is not an
>>> issue.
>>> The length of the documents don't matter, only the size of the
>>> vocabulary.
>>> Also, hyper parameter choices might need to be different for Gibbs
>>> sampling vs variational inference.
>>>
>>> On 09/18/2017 12:26 PM, Markus Konrad wrote:
>>>> Hi Chyi-Kwei,
>>>>
>>>> thanks for digging into this. I made similar observations with Gensim
>>>> when using only a small number of (big) documents. Gensim also uses the
>>>> Online Variational Bayes approach (Hoffman et al.). So could it be that
>>>> the Hoffman et al. method is problematic in such scenarios? I found
>>>> that
>>>> Gibbs sampling based implementations provide much more informative
>>>> topics in this case.
>>>>
>>>> If this was the case, then if I'd slice the documents in some way (say
>>>> every N paragraphs become a "document") then I should get better
>>>> results
>>>> with scikit-learn and Gensim, right? I think I'll try this out
>>>> tomorrow.
>>>>
>>>> Best,
>>>> Markus
>>>>
>>>>
>>>>
>>>>> Date: Sun, 17 Sep 2017 23:52:51 +0000
>>>>> From: chyi-kwei yau <chyikwei.yau at gmail.com>
>>>>> To: Scikit-learn mailing list <scikit-learn at python.org>
>>>>> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find
>>>>>      topics in NLTK Gutenberg corpus?
>>>>> Message-ID:
>>>>>      <CAK-jh0Ygd8fSdJom+gdDOHvAYCPuJVHHX77qcd+d4_xm6vi9yA at mail.gmail.com>
>>>>>
>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>
>>>>> Hi Markus,
>>>>>
>>>>> I tried your code and find the issue might be there are only 18 docs
>>>>> in the Gutenberg
>>>>> corpus.
>>>>> if you print out transformed doc topic distribution, you will see a
>>>>> lot of
>>>>> topics are not used.
>>>>> And since there is no words assigned to those topics, the weights
>>>>> will be
>>>>> equal to`topic_word_prior` parameter.
>>>>>
>>>>> You can print out the transformed doc topic distributions like this:
>>>>> -------------
>>>>>>>> doc_distr = lda.fit_transform(tf)
>>>>>>>> for d in doc_distr:
>>>>> ...     print np.where(d > 0.001)[0]
>>>>> ...
>>>>> [17 27]
>>>>> [17 27]
>>>>> [17 27 28]
>>>>> [14]
>>>>> [ 2  4 28]
>>>>> [ 2  4 15 21 27 28]
>>>>> [1]
>>>>> [ 1  2 17 21 27 28]
>>>>> [ 2 15 17 22 28]
>>>>> [ 2 17 21 22 27 28]
>>>>> [ 2 15 17 28]
>>>>> [ 2 17 21 27 28]
>>>>> [ 2 14 15 17 21 22 27 28]
>>>>> [15 22]
>>>>> [ 8 11]
>>>>> [8]
>>>>> [ 8 24]
>>>>> [ 2 14 15 22]
>>>>>
>>>>> and my full test scripts are here:
>>>>> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826
>>>>>
>>>>> Best,
>>>>> Chyi-Kwei
>>>>>
>>>>>
>>>>> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad <markus.konrad at wzb.eu>
>>>>> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I'm trying out sklearn's latent Dirichlet allocation
>>>>>> implementation for
>>>>>> topic modeling. The code from the official example [1] works just
>>>>>> fine and
>>>>>> the extracted topics look reasonable. However, when I try other
>>>>>> corpora,
>>>>>> for example the Gutenberg corpus from NLTK, most of the extracted
>>>>>> topics
>>>>>> are garbage. See this example output, when trying to get 30 topics:
>>>>>>
>>>>>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>>> (0.01)
>>>>>> fatiguing (0.01)
>>>>>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane
>>>>>> (301.83)
>>>>>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>>> (0.01)
>>>>>> fatiguing (0.01)
>>>>>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother
>>>>>> (55.27)
>>>>>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07)
>>>>>> charles
>>>>>> (166.21)
>>>>>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>>> (0.01)
>>>>>> fatiguing (0.01)
>>>>>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>>> (0.01)
>>>>>> fatiguing (0.01)
>>>>>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues
>>>>>> (0.01)
>>>>>> fatiguing (0.01)
>>>>>> ...
>>>>>>
>>>>>> Many topics tend to have the same weights, all equal to the
>>>>>> `topic_word_prior` parameter.
>>>>>>
>>>>>> This is my script:
>>>>>>
>>>>>> import nltk
>>>>>> from sklearn.feature_extraction.text import CountVectorizer
>>>>>> from sklearn.decomposition import LatentDirichletAllocation
>>>>>>
>>>>>> def print_top_words(model, feature_names, n_top_words):
>>>>>>       for topic_idx, topic in enumerate(model.components_):
>>>>>>           message = "Topic #%d: " % topic_idx
>>>>>>           message += " ".join([feature_names[i] + " (" +
>>>>>> str(round(topic[i],
>>>>>> 2)) + ")"
>>>>>>                                for i in
>>>>>> topic.argsort()[:-n_top_words -
>>>>>> 1:-1]])
>>>>>>           print(message)
>>>>>>
>>>>>>
>>>>>> data_samples = [nltk.corpus.gutenberg.raw(f_id)
>>>>>>                  for f_id in nltk.corpus.gutenberg.fileids()]
>>>>>>
>>>>>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
>>>>>>                                   stop_words='english')
>>>>>> tf = tf_vectorizer.fit_transform(data_samples)
>>>>>>
>>>>>> lda = LatentDirichletAllocation(n_components=30,
>>>>>>                                   learning_method='batch',
>>>>>>                                   n_jobs=-1,  # all CPUs
>>>>>>                                   verbose=1,
>>>>>>                                   evaluate_every=10,
>>>>>>                                   max_iter=1000,
>>>>>>                                   doc_topic_prior=0.1,
>>>>>>                                   topic_word_prior=0.01,
>>>>>>                                   random_state=1)
>>>>>>
>>>>>> lda.fit(tf)
>>>>>> tf_feature_names = tf_vectorizer.get_feature_names()
>>>>>> print_top_words(lda, tf_feature_names, 5)
>>>>>>
>>>>>> Is there a problem in how I set up the LatentDirichletAllocation
>>>>>> instance
>>>>>> or pass the data? I tried out different parameter settings, but
>>>>>> none of
>>>>>> them provided good results for that corpus. I also tried out
>>>>>> alternative
>>>>>> implementations (like the lda package [2]) and those were able to
>>>>>> find
>>>>>> reasonable topics.
>>>>>>
>>>>>> Best,
>>>>>> Markus
>>>>>>
>>>>>>
>>>>>> [1]
>>>>>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
>>>>>>
>>>>>>
>>>>>> [2] http://pythonhosted.org/lda/
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn