Repo for our EMNLP 2020 paper. We will clean up the implementation for improved ease-of-use, but provide the code included in our original submission for the time being.
If you use this code, please use the following citation:
@inproceedings{hoyle-etal-2020-improving,
title = "Improving Neural Topic Models Using Knowledge Distillation",
author = "Hoyle, Alexander Miserlis and
Goel, Pranav and
Resnik, Philip",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.137",
pages = "1752--1771",
}
-
As of now, you'll need two conda environments to run both the BERT teacher and topic modeling student (which is a modification of Scholar). The environment files are defined in
teacher/teacher.yml
andscholar/scholar.yml
for the teacher and topic model, respectively. For example:conda env create -f teacher/teacher.yml
(edit the first line in theyml
file if you want to change the name of the resulting environment; the default istransformers28
). -
We use the data processing pipeline from Scholar. We'll use the IMDb data to serve as a guide (preprocessing scripts for the Wikitext and 20ng data are also included for replication purposes, but the processing scripts aren't general-purpose):
conda activate scholar
python data/imdb/download_imdb.py
# main preprocessing script
python preprocess_data.py data/imdb/train.jsonlist data/imdb/processed --vocab_size 5000 --test data/imdb/test.jsonlist
# create a dev split from the train data--change filenames if using different data
create_dev_split.py
- Run the teacher model, below is an example using IMDb.
conda activate transformers28
python teacher/bert_reconstruction.py \
--input-dir ./data/imdb/processed-dev \
--output-dir ./data/imdb/processed-dev/logits \
--do-train \
--evaluate-during-training \
--truncate-dev-set-for-eval 120 \
--logging-steps 200 \
--save-steps 1000 \
--num-train-epochs 6 \
--seed 42 \
--num-workers 4 \
--batch-size 20 \
--gradient-accumulation-steps 8 \
--document-split-pooling mean-over-logits
- Collect the logits from the teacher model (the
--checkpoint-folder-pattern
argument accepts grub pattern matching in case you want to create logits for different stages of training; be sure to enclose in double quotes"
)
conda activate transformers28
python teacher/bert_reconstruction.py \
--output-dir ./data/imdb/processed-dev/logits \
--seed 42 \
--num-workers 6 \
--get-reps \
--checkpoint-folder-pattern "checkpoint-9000" \
--save-doc-logits \
--no-dev
- Run the topic model (there are a number of extraneous experimental arguments in
run_scholar.py
, which we intend to strip out in a future version).
conda activate scholar
python scholar/run_scholar.py \
./data/imdb/processed-dev \
--dev-metric npmi \
-k 50 \
--epochs 500 \
--patience 500 \
--batch-size 200 \
--background-embeddings \
--device 0 \
--dev-prefix dev \
-lr 0.002 \
--alpha 0.5 \
--eta-bn-anneal-step-const 0.25 \
--doc-reps-dir ./data/imdb/processed-dev/logits/checkpoint-9000/doc_logits \
--use-doc-layer \
--no-bow-reconstruction-loss \
--doc-reconstruction-weight 0.5 \
--doc-reconstruction-temp 1.0 \
--doc-reconstruction-logit-clipping 10.0 \
-o ./outputs/imdb