Package and data to run experiments for our NeurIPS paper, Is Automated Topic Model Evaluation Broken? and our Findings of EMNLP paper, Are Neural Topic Models Broken?
Preprocessing & coherence calculations are provided as part of an easy-to-use, well-documented package called soup-nuts
(see installation and usage instructions below). Links to the processed Wikipedia data used in the paper are also listed below. We hope that this tool encourages standardized & reproducible topic model evaluation.
Data is linked for download below
Please cite us if you find this package useful, and do not hesitate to create an issue or email us if you have problems!
If you use the human annotations or preprocessing:
@inproceedings{hoyle-etal-2021-automated,
title = "Is Automated Topic Evaluation Broken? The Incoherence of Coherence",
author = "Hoyle, Alexander Miserlis and
Goel, Pranav and
Hian-Cheong, Andrew and
Peskov, Denis and
Boyd-Graber, Jordan and
Resnik, Philip",
booktitle = "Advances in Neural Information Processing Systems",
year = "2021",
url = "https://arxiv.org/abs/2107.02173",
}
If you evaluate ground-truth evaluations or stability:
@inproceedings{hoyle-etal-2022-neural,
title = "Are Neural Topic Models Broken?",
author = "Hoyle, Alexander Miserlis and
Goel, Pranav and
Sarkar, Rupak and
Resnik, Philip",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
year = "2022",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.390",
doi = "10.18653/v1/2022.findings-emnlp.390",
pages = "5321--5344",
}
To install the preprocessing and metric evaluation package (called soup-nuts
), you first need to get poetry
.
poetry
can create virtual environments automatically, but will also detect any activated virtual environment and use that instead (e.g., if you are using conda, run conda create -n soup-nuts python=3.9 && conda activate soup-nuts
).
Then from the repository root, run
$ poetry install
Check the installation with
$ soup-nuts --help
If you do not use poetry, or you have issues with installation, you can run with python -m soup_nuts.main <command name>
Models need their own environments. Requirements are in the .yml
files in each of the soup_nuts/models/<model_name>
directories, and can be installed with
$ conda env create -f <environment_name>.yml
(names for each are set in the first line of each yml
file, and can be changed as needed)
Preprocessing relies on spaCy to efficiently tokenize text, optionally merging together detected entities and other provided phrases (e.g. New York
-> New_York
). This addition helps with topic readability.
Thorough instructions for usage can be accessed with
$ soup-nuts preprocess --help
Below we list a partial list of arguments (again, see --help
for more):
- Preprocessing:
--lowercase
,--ngram-range
,--min-doc-freq
,--max-doc-freq
,--max-vocab-size
- Standard preprocessing arguments with semantics borrowed from
CountVectorizer
in sklearn
- Standard preprocessing arguments with semantics borrowed from
--limit-vocab-by-df
- If setting
max-vocab-size
, sort the terms by their document-frequency rather than the overall term frequency
- If setting
--detect-entities
,--detect-noun-chunks
- Detect entities (
'New York'->'New_York'
) and noun chunks ('8.1 million American adults'-> '8.1_million_American_adults'
) with spaCy. The latter is a bit time-intensive and can lead to vocabulary size exposions.
- Detect entities (
--double-count-phrases
- Collocations are included alongside constituent unigrams,
'New York' -> 'New York New_York'
. Per Philip Resnik, this helps with topic readability, although we have not tested it empirically.
- Collocations are included alongside constituent unigrams,
--lemmatize
- Topic modeling experts are fairly unified against stemming (ant to a lesser extent lemmatization), and there is empirical work to back it up, but we include it as an option anyway. Lemmatization is also fallible, spaCy turns
taxes
totaxis
- Topic modeling experts are fairly unified against stemming (ant to a lesser extent lemmatization), and there is empirical work to back it up, but we include it as an option anyway. Lemmatization is also fallible, spaCy turns
--vocabulary
- An external vocabulary list will override other preprocessing and restrict it to the provided terms.
--phrases
- An external underscore-connected phrase list supplements phrases found with spaCy (e.g.,
nintendo_gameboy_advance
). Usesoup-nuts detect-phrases
to uncover statistical collocations.
- An external underscore-connected phrase list supplements phrases found with spaCy (e.g.,
- Data formatting
--text-key, --id-key
- The keys corresponding to the text and id in a csv or jsonlines file. ids created automatically based on the line number in the file if not provided.
--metadata-keys
- Keys in csv or jsonlines input data that you would like to pass through to the processed result (in a
<split>.metadata.jsonl
file). Separate with commas, e.g.,key1,key2,key3
. Can be helpful (if storage-intensive) to include the original raw text.
- Keys in csv or jsonlines input data that you would like to pass through to the processed result (in a
--output-text
- Output the processed text in the same order as the input (e.g.,
"The lamb lies down on Broadway" -> "lamb lies down broadway"
). Needed for accurate internal coherence calculations (evaluating on a train/val/test set).
- Output the processed text in the same order as the input (e.g.,
Scripts to process the data as in the NeurIPS paper:
To process a new dataset in the same way, use the following setup
soup-nuts preprocess \
<your_input_file> \
processed/${OUTPUT_DIR} \
--input-format text \
--lines-are-documents \
--output-text \
--lowercase \
--min-doc-size 5 \
--max-vocab-size 30000 \
--limit_vocab_by_df \
--max-doc-freq 0.9 \
--detect-entities \
--token-regex wordlike \
--no-double-count-phrases \
--min-chars 2
To use the exact vocabulary from our wikipedia settings, pass --vocabulary
and include this file.
We share the data for our papers here:
- NeurIPS paper data (not labeled)
- Wikitext.
train
is the 28-thousand article WikiText-103,full
is a 4.6-million article Wikipedia dump from the same period.- jsonlines files of fully pre-processed, sequential text to calculate coherence scores
- Format is
{"tokenized_text": "This is a document.", "id": 0 }
- train.metadata.jsonl
- full.metadata.jsonl
- Format is
- document-term matrices (read with
scipy.sparse.load_npz
) - Vocabulary file (this is not necessary)
- jsonlines files of fully pre-processed, sequential text to calculate coherence scores
- NYTimes data is licenced by LDC, but please contact us and we can arrange access to processed data
- Wikitext.
- Findings of EMNLP paper (with hierarchical labels for each document). Data has
"tokenized_text"
which corresponds to the tokenized text used for the models. The topline results in the paper are fromtrain
;test
is an additional setting that includes unseen labels at the lower level of the hierarchy (labels are maintained at the top level).- Wikitext. Raw text is in
"text"
, labels are in"supercategory", "category", "subcategory"
(only the last two are used in the paper) - Bills. Raw text is in
"summary"
, labels are"topic", "subtopic"
- Wikitext. Raw text is in
We use gensim to standardize metric calculations. You can download processed reference wikipedia corpora used in the paper at the following links:
To obtain metrics on topics, run soup-nuts coherence
with the following arguments:
--topics-fpath
- A text file, where each line contains the words for a given topic, ordered by descending word probability (not necessary to have the full vocabulary)
--reference-fpath
- The file used to calculate co-occurence counts. It is either a jsonl or text file, where each line has space-delimited, processed text in the same order (sequence) as the original data, e.g.,
"The lamb lies down on Broadway" -> "lamb lies down broadway"
. - This is what is produced with the
--output-text
flag insoup-nuts preprocess
(If a jsonl file is provided, it assumes the key is"tokenized-text"
)
- The file used to calculate co-occurence counts. It is either a jsonl or text file, where each line has space-delimited, processed text in the same order (sequence) as the original data, e.g.,
--vocabulary-fpath
- The training set vocabulary, that is, the vocabulary that would have been used by the model to generate the topics file being evaluated. Can either be json list/dict (if keys are terms), or a plain text file.
--coherence-measure
, one of'u_mass', 'c_v', 'c_uci', 'c_npmi'
, see gensim for details--top-n
, the number of words from each topic used to calculate coherence--window-size
, the size of the sliding window over the corpus to obtain co-occurrence counts. Leave blank to use the gensim default.--output-fpath
- Where you would like to save the file (e.g., model_directory/coherences.json)
As an example:
soup-nuts coherence \
<path-to-topics.txt> \
--output-fpath ./coherences.json \
--reference-fpath data/wikitext/processed/train.metadata.jsonl \
--coherence-measure c_npmi \
--vocabulary-fpath <path-to-train-vocab.json> \
--top-n 15
Use --update
to add to an existing file.
All models currently require independent conda environments. To get model code, you need to clone with the --recurse-submodules
flag.
Although some effort has been made to unify the arguments of the models, for now they should be treated separately.
Running knowledge distillation also requires a separate environment, as it involves the use of the transformers
library.
Some models and settings are not yet fully integrated in the pipeline, and require additional steps or specific flags, as described below (NB: they also introduce some redundancy in the data.)
scholar
model- For
soup-nuts preprocess
, use these flags:--retain-text
scholar
requires a specific input format. Run the python scriptdata/convert_processed_data_to_scholar_format.py
(dependencies are insoup_nuts/models/scholar/scholar.yml
).
- For
- Covariates/labels (currently only supported in
scholar
)- For
soup-nuts preprocess
, specify labels/covariates with these flags:--input-format jsonl --jsonl-text-key <your_text_key> --output-format jsonl --jsonl-metadata-keys <key1,key2,key3,...>
(in addition to steps forscholar
)
- For
- Knowledge distillation (currently only supported in
scholar
)- For
soup-nuts preprocess
, retain the text as "metadata" with these flags--input-format jsonl --jsonl-text-key <your_text_key> --output-format jsonl --jsonl-metadata-keys <your_text_key>
(in addition to steps forscholar
)
- For
Below, we outline how to run a single mallet model on some example data.
After installing poetry
and miniconda (see above), clone the repo, create the environment and install the packages.
$ git clone -b dev https://github.com/ahoho/topics.git --recurse-submodules
$ conda create -n soup-nuts python=3.9
$ conda activate soup-nuts
$ poetry install
$ pip install pandas
With it installed, you can now process data. To process our example data with some sensible settings:
soup-nuts preprocess \
data/examples/speeches_2020.01.04-2020.05.04.jsonl \
data/examples/processed-speeches \
--input-format jsonl \
--jsonl-text-key text \
--jsonl-id-key id \
--lowercase \
--token-regex wordlike \
--min-chars 2 \
--min-doc-freq 5 \
--max-doc-freq 0.95 \
--detect-entities
Now that the data is processed, we can run a topic model. Let's set up mallet. You will need to download it here, then extract it:
$ curl http://mallet.cs.umass.edu/dist/mallet-2.0.8.tar.gz
$ tar -xzvf mallet-2.0.8.tar.gz
Then we need to create a new python environment to run it:
$ cd soup_nuts/models/gensim
$ conda create -n mallet python=3.8
$ conda activate mallet
$ pip install -r requirements.txt
Finally, from the top-level directory, we run the model with
python soup_nuts/models/gensim/lda.py \
--input_dir data/examples/processed-speeches \
--output_dir results/mallet-speeches \
--eval_path train.dtm.npz \
--num_topics 50 \
--mallet_path soup_nuts/models/gensim/mallet-2.0.8/bin/mallet \
--optimize_interval 10
Installation Notes: Make sure the above mallet_path points to where you installed mallet! Otherwise you will get "returned non-zero exit status 127". The current .yml file needs update. Installing gensim before 4.0 will be needed in your mallet enviornment (i.e., pip install gensim==3.8.3). Pandas, numpy, scipy may be needed as well.
View the top words in each topic with
$ cut -f 1-10 -d " " results/mallet-speeches/topics.txt