Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
example_20newsgroups.py		example_20newsgroups.py
lda_model.py		lda_model.py

README.md

Latent Dirichlet Allocation

LDA is a classical algorithm for probabilistic graphical models. It assumes hierarchical Bayes models with discrete variables on sparse doc/word graphs. This example shows how it can be done on DGL, where the corpus is represented as a bipartite multi-graph G. There is no back-propagation, because gradient descent is typically considered inefficient on probability simplex. On the provided small-scale example on 20 news groups dataset, our DGL-LDA model runs 50% faster on GPU than sklearn model without joblib parallel.

Key equations

The corpus is generated by hierarchical Bayes: document(d) -> latent topic(z) -> word(w)
All positions in the same document have shared topic distribution θ_d~Dir(α)
All positions of the same topic have shared word distribution β_z~Dir(η)
The words in the same document / topic are correlated.

MAP

A simplified MAP model is just a non-conjugate model with an inner summation to integrate out the latent topic variable:

The main complications are that θ_d / β_z are shared in the same document / topic and the variables reside in a probability simplex. One way to work around it is via expectation maximization

An explicit posterior is ϕ_dwz ∝ θ_dz * β_zw
E-step: find summary statistics with fractional membership
M-step: set θ_d, β_z proportional to the summary statistics
With an explicit posterior, the bound is tight.

Variational Bayes

A Bayesian model adds Dirichlet priors to θ_d & β_z. This causes the posterior to be implicit and the bound to be loose. We will still use an independence assumption and cycle through the variational parameters similarly to coordinate ascent.

The evidence lower-bound is

ELBO objective function factors as

Similarly, optimization alternates between ϕ, γ, λ. Since θ, β are random, we use an explicit solution for E[log X] under Dirichlet distribution via digamma function.

DGL usage

The corpus is represented as a bipartite multi-graph G. We use DGL to propagate information through the edges and aggregate the distributions at doc/word nodes. For scalability, the phi variables are transient and updated during message passing. The gamma / lambda variables are updated after the nodes receive all edge messages. Following the conventions in [1], the gamma update is called E-step and the lambda update is called M-step, because the beta variable has smaller variance. The lambda variable is further recorded by the trainer and we may further approximate its MAP estimate by using a large step size for word nodes. A separate function is used to produce perplexity, which is based on the ELBO objective function divided by the total numbers of word/doc occurrences.

Example

%run example_20newsgroups.py

Approximately matches scikit-learn training perplexity after 10 rounds of training.
Exactly matches scikit-learn training perplexity if word_z is set to lda.components_.T
To compute testing perplexity, we need to fix the word beta variables via MAP estimate. This step is not taken by sklearn and its beta part seems to contain another bug by dividing the training loss by the testing word counts. Nonetheless, I recommend setting step_size["word"] to a larger value to approximate the corresponding MAP estimate.
The DGL-LDA model runs 50% faster on GPU devices compared with sklearn without joblib parallel.

Advanced configurations

Set step_size["word"] to a large value obtain a MAP estimate for beta.
Set 0<word_rho<1 for online learning.

References

Matthew Hoffman, Francis Bach, David Blei. Online Learning for Latent Dirichlet Allocation. Advances in Neural Information Processing Systems 23 (NIPS 2010).
Reactive LDA Library blogpost by Yingjie Miao for a similar Gibbs model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lda

lda

README.md

Latent Dirichlet Allocation

Key equations

DGL usage

Example

Advanced configurations

References

Files

lda

Directory actions

More options

Directory actions

More options

Latest commit

History

lda

Folders and files

parent directory

README.md

Latent Dirichlet Allocation

Key equations

DGL usage

Example

Advanced configurations

References