This is an implementation of the FEVER shared task. The goal is to create a system based on a large corpus which determines whether a given claim is supported, refuted, or with not enough information for factual verification. We use pre-processed Wikipedia Pages (June 2017 dump) as the evidence corpus and this is provided by the FEVER task, together with the large training dataset with 185,445 claims generated by altering sentences extracted from Wikipedia. The dataset is labeled as Supported, Refuted, and NotEnoughInfo with necessary evidences for the judgenment. We employ TF-IDF and PMI approaches to the term-document frequency matrix to retrieve most relevant documents and sentences. Simple linear classification with overlapping words and word cross-product feature function results in 37% and 42% accuracy, respectively, and it is comparable to the results of the baseline approach in the original FEVER paper.
This work can be seen as a simplified implementation and (small) expansion of the pipeline baseline described in the paper: FEVER: A large-scale dataset for Fact Extraction and VERification.
This is a final project for CS224U Natural Language Understanding, Spring 2018 at Stanford University.
Clone the repository
git clone https://github.com/jongminyoon/fever.git
cd fever
Install requirements (run export LANG=C.UTF-8
if installation of DrQA fails)
pip install -r requirements.txt
Download the FEVER dataset from the website into the data directory
mkdir data
mkdir data/fever-data
# We use the data used in the baseline paper
wget -O data/fever-data/train.jsonl https://s3-eu-west-1.amazonaws.com/fever.public/train.jsonl
wget -O data/fever-data/dev.jsonl https://s3-eu-west-1.amazonaws.com/fever.public/paper_dev.jsonl
wget -O data/fever-data/test.jsonl https://s3-eu-west-1.amazonaws.com/fever.public/paper_test.jsonl
The data preparation consists of three steps: downloading the articles from Wikipedia, indexing these for the Evidence Retrieval and performing the negative sampling for training.
Download the pre-processed Wikipedia articles from the website and unzip it into the data folder.
wget https://s3-eu-west-1.amazonaws.com/fever.public/wiki-pages.zip
unzip wiki-pages.zip -d data
Construct an SQLite Database. A commercial personal laptop seems not work when dealing with the entire database as a single file so we split the Wikipedia database into a few files too.
python build_db.py data/wiki-pages data/single --num-files 1
python build_db.py data/wiki-pages data/fever --num-files 5
Create a term-document count matrix for each split, and then merge the count matrices.
python build_count_matrix.py data/fever data/index
python merge_count_matrix.py data/index data/index
Two schemes are tried, TF-IDF and PMI.
python reweight_count_matrix.py data/index/count-ngram\=1-hash\=16777216.npz data/index --model tfidf
python reweight_count_matrix.py data/index/count-ngram\=1-hash\=16777216.npz data/index --model pmi
The remaining task for FEVER challenge, i.e. document retrieval, sentence selection, sampling for NotEnoughInfo, and RTE training are done in IPython notebook fever.ipynb
and implementation in fever.py
. The class Oracle
reads either TF-IDF or PMI matrix and have methods for finding relevant documents, sentences, etc. given the input claim.
The oracle accuracies for document retrieval for varying number of documents retrieved are
Accuracy (%) | Model | |
---|---|---|
Num Docs | TF-IDF | PMI |
1 | 23.2 | 23.2 |
3 | 45.5 | 45.5 |
5 | 56.9 | 56.9 |
10 | 69.0 | 69.0 |
Num Docs | Accuracy (%) |
---|---|
1 | 51.2 |
3 | 67.0 |
5 | 72.7 |
10 | 81.8 |
We used logistic classifier with grid cross-validation for best hyperparamters. The details can be found in fever_paper.pdf
in the folder reports
.
Precision | Recall | F1 score | |
---|---|---|---|
Supported | 0.337 | 0.798 | 0.455 |
Refuted | 0.426 | 0.012 | 0.023 |
NEI | 0.362 | 0.326 | 0.343 |
avg / total | 0.374 | 0.346 | 0.274 |
Precision | Recall | F1 score | |
---|---|---|---|
Supported | 0.378 | 0.410 | 0.394 |
Refuted | 0.535 | 0.219 | 0.311 |
NEI | 0.339 | 0.527 | 0.420 |
avg / total | 0.421 | 0.385 | 0.375 |