LDA is a classical algorithm for probabilistic graphical models. It assumes hierarchical Bayes models with discrete variables on sparse doc/word graphs. This example shows how it can be done on DGL, where the corpus is represented as a bipartite multi-graph G. There is no back-propagation, because gradient descent is typically considered inefficient on probability simplex. On the provided small-scale example on 20 news groups dataset, our DGL-LDA model runs 50% faster on GPU than sklearn model without joblib parallel.
- The corpus is generated by hierarchical Bayes: document(d) -> latent topic(z) -> word(w)
- All positions in the same document have shared topic distribution θ_d~Dir(α)
- All positions of the same topic have shared word distribution β_z~Dir(η)
- The words in the same document / topic are correlated.
MAP
A simplified MAP model is just a non-conjugate model with an inner summation to integrate out the latent topic variable:
The main complications are that θ_d / β_z are shared in the same document / topic and the variables reside in a probability simplex. One way to work around it is via expectation maximization
- An explicit posterior is ϕ_dwz ∝ θ_dz * β_zw
- E-step: find summary statistics with fractional membership
- M-step: set θ_d, β_z proportional to the summary statistics
- With an explicit posterior, the bound is tight.
Variational Bayes
A Bayesian model adds Dirichlet priors to θ_d & β_z. This causes the posterior to be implicit and the bound to be loose. We will still use an independence assumption and cycle through the variational parameters similarly to coordinate ascent.
- The evidence lower-bound is
- ELBO objective function factors as
- Similarly, optimization alternates between ϕ, γ, λ. Since θ, β are random, we use an explicit solution for E[log X] under Dirichlet distribution via digamma function.
The corpus is represented as a bipartite multi-graph G. We use DGL to propagate information through the edges and aggregate the distributions at doc/word nodes. For scalability, the phi variables are transient and updated during message passing. The gamma / lambda variables are updated after the nodes receive all edge messages. Following the conventions in [1], the gamma update is called E-step and the lambda update is called M-step, because the beta variable has smaller variance. The lambda variable is further recorded by the trainer and we may further approximate its MAP estimate by using a large step size for word nodes. A separate function is used to produce perplexity, which is based on the ELBO objective function divided by the total numbers of word/doc occurrences.
%run example_20newsgroups.py
- Approximately matches scikit-learn training perplexity after 10 rounds of training.
- Exactly matches scikit-learn training perplexity if word_z is set to lda.components_.T
- To compute testing perplexity, we need to fix the word beta variables via MAP estimate. This step is not taken by sklearn and its beta part seems to contain another bug by dividing the training loss by the testing word counts. Nonetheless, I recommend setting
step_size["word"]
to a larger value to approximate the corresponding MAP estimate. - The DGL-LDA model runs 50% faster on GPU devices compared with sklearn without joblib parallel.
- Set
step_size["word"]
to a large value obtain a MAP estimate for beta. - Set
0<word_rho<1
for online learning.
- Matthew Hoffman, Francis Bach, David Blei. Online Learning for Latent Dirichlet Allocation. Advances in Neural Information Processing Systems 23 (NIPS 2010).
- Reactive LDA Library blogpost by Yingjie Miao for a similar Gibbs model