LDA (Latent Dirichlet Allocation) was first introduced by Blei et al. (2003), and has become the iconic topic model in NLP. LDA is a generative, probabilistic, unsupervised, and bag-of-words model for discrete data, such as text.
Input into the model is a term-document matrix, similar to tf-idf (Salton and McGill, 1983). LSI (latent semantic indexing) (Deerwester et al., 1990) uses a SVD of the term-document matrix to reduce the dimensionality and capture the tf-idf combinations that holds most of the variance.
Worth noting is that the LDA model can be applied to other structures than simply words. N-grams or sentences could be used. In fact it could be used for non-text data as well.
\(M\) number of documents, denoted by \(\mathcal{D} = { w_1, w_2, \dots, w_M }\).
\(N\) number of words in each document denoted by \(w = ( w_1, w_2, \dots, w_N )\), \(w_n\) is the n:th word. Document i has \(N_i\) words.
A word is a vector with the same length as the vocabulary indexed by \(1, \dots, V\). The v:th word in the vocabulary is represented by a word vector \(w\) such that the v:th element, \(w^v=1\), and \(w^u=0\) for \(u \neq v\).
\(K\) it the number of topics.
\(\alpha\) is the parameter of the Dirichlet prior on the per-document topic distribution, typically around \(0.1\).
\(\beta\) is the parameter of the Dirichlet prior on the per-topic word distribution, typically around \(0.01\).
\(\theta_i\) is the topic distribution for document \(i\).
\(\phi_k\) is the word distribution for topic \(k\).
\(z_{ij}\) is the topic for the j:th word in document \(i\).
Depending on what paper you read the notation is a bit different, but it is useful to get an understanding of all the variables, and parameters in the model.
A probability vector, \(p=(p_1, p_2, \dots , p_S)\), is chosen such that,
\(f(p) = p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \dots p_M^{\alpha_M - 1}\),
is the density. Note that if \(M=2\) we get the Beta-distribution, and if \(\alpha_i = 1 \forall i\) then we get a uniform distribution.
Further, the Dirichlet distribution is conjugate prior to the Categorical distribution.
If \((p_1, ..., p_M) \sim Dir(\alpha_1, ..., \alpha_M)\) and then \(X_1, ..., X_n \in {1, ..., M}\) are chosen independently according to \(Cat(p_1, ..., p_M)\), then,
\begin{equation} p|X \sim Dir(\alpha_1 + n_1, …, \alpha_M + n_M) \end{equation}
LDA assumes the following generative process for each of the \(M\) documents, \(w_i\), with length \(N_i\), in a corpus \(\mathcal{D}\),
Choose \(\theta_i = \left( \theta_i(1), ..., \theta_i(K) \right) \sim Dir(\alpha, ...,\alpha)\) independently. Where \(i = {1, ..., M}\), and the Dirichlet distribution has a symmetric \(\alpha\)’s, meaning \(\alpha_1 = \alpha_2 = ... = \alpha_K\).
For each topic choose \(\phi_k = \left( \phi_k(1), ..., \phi_k(V) \right) \sim Dir(\beta, ..., \beta)\) independently. Where \(k={1, ..., K}\), and \(\beta_1 = \beta_2 = ... = \beta_V\).
For each of the word positions i, j, where \(i \in {1, ..., M}\) and \(j \in {1, ..., N_i}\)
Thus, each word is chosen by first picking a topic according to the topic distribution of the document, and then picking a word according to the word distribution of that topic.
Note that \(N_i\), the lengths of the document, are treated as independent of \(w\) and \(z\).
The generative process of the LDA model is often visualized with plate notation,
<img src=”/images/lda_plate.png” width=”500”, align=”middle”>
(Note that the subscript of \(N_i\) is dropped in the plate notation above).
The generative process is a nice way to explain the LDA model, but in practice the model is not used to generate document. Instead, we observe the documents and want to sample from the posterior distribution,
\[Z, \theta, \phi | M\]We cannot compute the conditional probabilities, however we can approximate a sampling according to the distribution, which is usually done with Gibbs sampling, which is a Markov chain Monte Carlo (MCMC) algorithm that is used to approximate a sampling process from a multivariate distribution when direct sampling is not desirable.
The reason for why we cannot directly sample from the distribution is because,
\begin{equation} P(z, \theta, \phi | w) = \frac{P(z, w, \theta, \phi)}{P(w)}, \end{equation}
but the denominator, the probability of observing the words we see, is of too many dimensions and impossible to compute. Instead, we can compare the ratios,
\[\frac{P(z_1, \theta_1, \phi_1 | w)}{P(z_2, \theta_2, \phi_2 | w)} = \frac{P(z_1, \theta_1, \phi_1, w)}{P(z_2, \theta_2, \phi_2, w)}\]Gibbs sampling can be used when the joint distribution is unknown or cannot be sampled from, but the conditional marginal distribution of the variables are known and can be sampled from.
We can look at an example, say that we have two random variables, \(A\) and \(B\), and that they both can take on the values either \(0\) or \(1\). Thus, the outcome space is \((0,0), (0,1), (1,0), (1,1)\). However, if we do not know the probabilities for these four events, but we know the conditional probabilities, i.e., \(P(A|B=0)\), \(P(A|B=1)\), \(P(B|A=0)\), and \(P(B|A=1)\), then we can simulate a sampling from the joint distribution. Thus, for this little example the Gibbs sampling would look like:
Sampling \(A_0\) and \(B_0\) from some binary distribution.
Sample \(A_1 \sim P(A \vert B_0)\) and \(B_1 \sim P(B \vert A_1)\).
Continue like this until you have created a large enough sample.
Note that we, in each iteration, use the previous sampled value for \(B\) in order to create \(A\). Running this simulation for enough iterations we would have a good estimation of the joint probability distribution. In fact the algorithm asymptotically converges to the true distribution. We say that the stationary distribution of the MC is equal to the true distribution.
In practice, Collapsed Gibbs sampling is often used since it’s more memory efficient. The collapsed Gibbs sampler integrates out (collapses) one or more variables when sampling another. E.g., consider three r.v. \(A\), \(B\), and \(C\), the Gibbs sampler would estimate the conditional distribution for \(A\) with estimating \(P(A | B, C)\), whereas the collapsed version would use, e.g., \(P(A | B)\), and thus integrating out \(C\). Collapsed Gibbs sampling works well it the variable that is integrated out is the conjugate prior to the one being sampled.
Sample from \(P(z|w)\).
\[P(z|w) \propto P(z, w) = \mathrm{E}[P(z, w | \theta, \phi)]\] \[= \mathrm{E}[P(w | \theta, \phi, z)] \mathrm{E}[P(z | \theta, \phi)]\] \[= \mathrm{E}[P(w | \phi, z)] \mathrm{E}[P(z | \theta)]\](Since, e.g., \(w\) is not dependent on \(\theta\)).
At this stage one has to look at the higher moments of the Dirichlet distribution.
In practice, for a word \((i, j)\) we erase its topic, and given all other topics for words, we assign a topic to word \((i, j)\), and then update \(z_{i, j}\).
\[P(z_i = j | z_{-i}, w_i, d_i) \propto \frac{C_{w_i j}^{WT} + \beta}{\sum_{w=1}^W C_{wj}^{WT} + W \beta} \frac{C_{d_i j}^{DT}+\alpha}{\sum_{d_i t}^{DT} + T \alpha}\]where \(C^{WT}\) and \(C^{DT}\) are count matrices of the number of times word \(w\) it assigned to topic \(j\), and the number of times topic \(j\) is assigned to a token in document, \(d\). \(W\) is the number of words in the vocabulary, and \(T\) is the number of topics. Further the \(\phi\) and \(\theta\) can be estimated by,
\[\phi_i^j = \frac{C_{ij}^{WT} + \beta}{\sum_{k=1}^{W} C_{kj}^{WT} + W \beta}\] \[\theta_j^d = \frac{C_{dj}^{DT} + \alpha}{\sum_{k=1}^T C_{dk}^{DT} + T \alpha}\]a good explaination can be found in Steyvers and Griffiths (2007).
import numpy as np
import pandas as pd
def count_matrices(M, K):
'''
Word-topic count matrix, C_wt, is the number of times word, w, is assigned
to topic j.
Document-topic count matrix, C_dt, it the number of times topic j is
assigned to some word token in document d.
Args:
K: int, number of topics
M: numpy array, term-document matrix
Returs:
cwt: Word-topic count matrix
cdt: Document-topic count matrix
'''
D = np.shape(M)[0]
V = np.shape(M)[1]
shape_cwt = (V, K)
shape_cdt = (D, K)
cwt = np.zeros(shape_cwt)
cdt = np.zeros(shape_cdt)
assigned_topic = np.zeros(int(np.sum(M))) # length equals all total words
w = 0
for i in range(D): # for each document
for j in range(len(M[i,:])): # for each word
for s in range(int(M[i, j])): # for each token
z = int(np.random.randint(K, size=1)) # rand. topic to token
cwt[j, z] += 1 # add 1 to topic count for word
cdt[i, z] += 1 # add 1 to topic count for document
assigned_topic[w] = z
w += 1
return cwt, cdt, assigned_topic
def lda_c_gibbs(M, K, alpha, beta, iterations):
'''
LDA with collapsed Gibbs sampling.
'''
cwt, cdt, at = count_matrices(M, K)
D = np.shape(M)[0]
V = np.shape(M)[1]
for n in range(iterations):
print('Iteration {}'.format(n) + 'of {}'.format(iterations) + '.')
w = 0
for i in range(D):
for j in range(len(M[i, :])):
for s in range(int(M[i, j])):
z_p = int(at[w]) # previously assigned topic
cwt[j, z_p] = cwt[j, z_p] - 1 # remove the assigned topic
cdt[i, z_p] = cdt[i, z_p] - 1 # remove count of topic
n_1 = cwt[j, ] + beta
n_2 = cdt[i, ] + alpha
d_1 = np.sum(cwt, axis=0) + V * beta
d_2 = np.sum(cdt[i, ], axis=0) + K * alpha
p_z = n_1 / d_1 * n_2 / d_2
p_z = p_z / np.sum(p_z)
z = np.random.choice(range(K), p=p_z)
cwt[j, z] += 1
cdt[i, z] += 1
at[w] = z
w += 1
numerator_theta = cdt + alpha
denominator_theta = np.sum(cdt, axis=1) + K * alpha
denominator_theta = np.reshape(denominator_theta, (D, 1))
theta = numerator_theta / denominator_theta
numerator_phi = cwt + beta
denominator_phi = np.sum(cwt, axis=0) + V * beta
phi = numerator_phi / denominator_phi
return theta, phi
def extract_topics(phi, vocab_key, n_words):
V = np.shape(phi)[0]
K = np.shape(phi)[1]
col_names = []
for i in range(K):
col_names.append('topic_{}'.format(i))
row_names = []
for i in range(V):
row_names.append(vocab_key[i])
df = pd.DataFrame(phi, columns=col_names, index=row_names)
for i in range(K):
print('Topic {}:'.format(i))
print(df['topic_{}'.format(i)].sort_values(ascending=False)[0:n_words])
print('\n')
return df
When running the code it is slow since it is only using one core. Although collapsed Gibbs sampling is a sequential algorithm, there are working implementations using multiprocessing.
Using the Amazon book review corpus, we can train the model, and see what results we end up with. Starting by defining some functions for loading and processing the data.
import spacy
import numpy as np
from collections import Counter
def load_corpus(d_data, encoding, p_docs=1):
'''
Args:
d_data: string with location to the .txt file
.txt: file where each line is a document
encoding: string with encoding of .txt file
p_docs: float, percentage of documents collected
Returns:
content: list of strings, each string is a document
'''
f = open(d_data, 'r', encoding=encoding)
# Read content line by line
if f.mode == 'r':
content = f.readlines()
# Subset of documents
if p_docs != 1:
n_docs = int(len(content) * p_docs)
content = content[0:n_docs]
return content
def pre_process(corpus):
'''
Args:
corpus: list of strings, each string it one document
Returns:
processed_corpus: list of lists
'''
# Load model. Installed w/ 'python -m spacy download en_core_web_sm'.
nlp = spacy.load('en_core_web_sm')
# Simple pre processing
processed_corpus = corpus
for i in range(len(corpus)):
doc = corpus[i]
doc = nlp(doc)
doc = [token.lemma_ for token in doc if not token.is_stop and
token.text.isalpha() and not token.is_punct]
processed_corpus[i] = doc
return processed_corpus
def create_vocab(corpus):
'''
Args:
corpus: list of lists
Returns:
vocab_key: dictionary, dict[0]='ability' and dict['ability']=0
vocab: sorted list of all token
n: int, length of vocab
'''
flat_corpus = [token for doc in corpus for token in doc]
vocab = list(set(flat_corpus))
vocab.sort()
n = len(vocab)
vocab_key = {}
for i in range(len(vocab)):
vocab_key[vocab[i]] = i
vocab_key[i] = vocab[i]
return vocab_key, vocab, n
def term_doc_matrix(corpus, vocab_key, n):
'''
Args:
corpus: list of lists
vocab_key: vocabulary dictionary
n: length of vocabulary
Returns:
M: numpy array, term-document matrix. DxV (D rows, V columns)
'''
m = len(corpus)
shape = (m, n) # each row is a document
M = np.zeros(shape)
for i in range(m):
counts = Counter(corpus[i])
for j in range(n):
M[i, j] = counts[vocab_key[j]]
return M
Running the main file, using a sub set of 280 documents,
import common
import lda
import numpy as np
import random
import time
# Set seed
random.seed(42)
# Load data
d_data = '/home/magnus/Projects/ml_for_nlp/data/books.txt'
encoding = 'latin1'
p_docs = 0.001
#p_docs = 0.1
corpus = common.load_corpus(d_data, encoding, p_docs)
# Pre process
corpus = common.pre_process(corpus)
vocab_key, vocab, n = common.create_vocab(corpus)
M = common.term_doc_matrix(corpus, vocab_key, n)
print(len(corpus))
# LDA parameters
K = 5
alpha = 0.1
beta = 0.01
iterations = 30
# Train LDA
start_time = time.time()
cwt, cdt, at = lda.count_matrices(M, K)
theta, phi = lda.lda_c_gibbs(M, K, alpha, beta, iterations)
end_time = time.time()
work_time = end_time - start_time
# Results
df = lda.extract_topics(phi, vocab_key, 10)
print('The job took ' + str(work_time) + 'seconds to complete.')
Topic 0:
book 0.055679
read 0.017042
like 0.013674
story 0.012287
good 0.010900
thing 0.010107
people 0.009513
recommend 0.009116
find 0.009116
reader 0.008126
Name: topic_0, dtype: float64
Topic 1:
book 0.052121
great 0.023055
read 0.017292
know 0.008772
love 0.008271
work 0.008271
chapter 0.007770
long 0.007520
give 0.007269
mr 0.007018
Name: topic_1, dtype: float64
Topic 2:
novel 0.017466
read 0.013342
write 0.013100
world 0.013100
time 0.012857
book 0.012615
think 0.010432
story 0.010189
page 0.009947
good 0.009947
Name: topic_2, dtype: float64
Topic 3:
school 0.012282
book 0.011384
child 0.011384
year 0.010785
new 0.010186
john 0.010186
high 0.008089
help 0.007790
know 0.007790
man 0.007490
Name: topic_3, dtype: float64
Topic 4:
book 0.019200
way 0.018000
life 0.014400
work 0.013501
woman 0.011401
experience 0.010501
love 0.009301
character 0.007502
horse 0.007502
human 0.007202
Name: topic_4, dtype: float64
The job took 111.38684487342834seconds to complete.
Regarding the resluts,
A better pre processing seem to have been benefitial, in which one, e.g., drops words that are common in all documents.
A larger part of the coprus would yield better results.
A serious attempt at uncovering the underlying topics would involve hyperparameter optimization.