LIAR-Detect is a machine learning project focused on classifying political statements into categories such as true
, false
, half-true
, and others. Leveraging the LIAR dataset, this project explores text-based fake news detection and provides insights into the credibility of statements made by public figures. The trained model is deployed as a web service, ready for integration into fact-checking workflows or other applications.
The proliferation of misinformation and fake news, especially in the political domain, has made it challenging to discern credible statements. This project aims to develop a classification model capable of analyzing short political statements and categorizing them into truthfulness levels.
The dataset, LIAR, contains approximately 10,000 labeled statements with metadata such as the speaker, subject, and historical truthfulness records. By combining text analysis with metadata, the project aims to identify misinformation.
The dataset used for this project is the LIAR Benchmark Dataset, introduced in William Yang Wang's paper, "Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection."
Dataset Features:
- Text Data: Short political statements.
- Metadata: Speaker, party affiliation, context, and historical truthfulness records.
- Labels:
true
,mostly-true
,half-true
,false
,barely-true
, andpants-on-fire
.
Source: ACL 2017 Paper
The dataset is automatically downloaded at the start of the Jupyter Notebook.
-
Data Preparation and Exploratory Data Analysis (EDA):
- Handle missing values.
- Analyze label and feature distribution as well as statement lengths, topics and important words.
- Preprocess text data using TF-IDF vectorization and One-Hot encoding.
- Engineer features from metadata for enhanced model performance.
-
Model Training and Selection:
- Train and evaluate multiple models:
- Logistic Regression
- Random Forest Classifier
- Linear SVM
- Multinomial Naive Bayes
- XGBoost
- Perform hyperparameter tuning using GridSearchCV.
- Select the best model based on cross-validation accuracy.
- Train and evaluate multiple models:
-
Model Evaluation:
- Assess performance on the test dataset.
- Generate classification reports and visualize confusion matrices.
-
Containerization:
- Package the Flask web service into a Docker container for portability.
- Push the container image to Amazon Elastic Container Registry (ECR) for centralized storage.
-
Cloud Deployment:
- Deploy the containerized service to AWS ECS Fargate for scalable and serverless hosting.
- Leverage spot instances for cost efficiency.
- Detailed deployment instructions can be found in the aws_ecs_deployment directory subproject's README.
- Python 3.12 or higher
pipenv
for dependency management- Docker (optional, for containerization)
-
Clone the repository:
git clone [email protected]:tomtuamnuq/LIAR-Detect-Fake-News-Statement-Classification.git cd LIAR-Detect-Fake-News-Statement-Classification
-
Install dependencies using
pipenv
:pipenv install
-
Activate the Pipenv shell in the project root:
pipenv shell
-
Launch the Jupyter Notebook (if needed) with the correct working directory:
jupyter lab --notebook-dir=notebooks
-
Execute at least the first cell of the Jupyter Notebook to load the dataset into the
data
directory.
To train the model, use the train.py
script located in the src
directory. This will preprocess the data, train the model, and save the required pickle files (feature_engineer.pkl
and xgboost_model.pkl
) in the models
directory. Ensure that the train.tsv
and test.tsv
files exist in the data
directory before training!
Run the training script:
python src/train.py
Once complete, the models/
directory will contain the following:
feature_engineer.pkl
: The pickled feature engineering pipeline.xgboost_model.pkl
: The trained XGBoost model.
To test the trained model inference implementation, use pytest
to run the tests in the tests
directory. These tests verify model inference, schema validation, and the prediction pipeline.
Run the tests:
pytest tests
Make sure that the trained model files (feature_engineer.pkl
and xgboost_model.pkl
) exist in the models
directory before running the tests.
Follow these steps to build, run, and test the Flask application using Docker:
Ensure you are in the project root directory and run the following command to build the Docker image:
docker build -t liar-detect-app .
Start the container and expose it on port 5042
:
docker run -p 5042:5042 liar-detect-app
The application will now be accessible at http://127.0.0.1:5042
.
You can test the /predict
endpoint using a test JSON file. For example, to test with test_single_false.json
, run:
curl -X POST -H "Content-Type: application/json" -d @tests/test_single_false.json http://127.0.0.1:5042/predict
The API should respond with a JSON object containing the predicted label. For example:
{
"predicted_label": "pants-fire",
"true_label": "pants-fire",
"correct": true
}
This subsection explains how to set up the AWS CLI on Arch Linux, push the Docker image to Amazon ECR Public, and retrieve the image locally.
On Arch Linux, you can install the AWS CLI using the aws-cli
package:
yay -S aws-cli
Set up the AWS CLI with your credentials and default region:
aws configure
You will be prompted to enter:
- AWS Access Key ID
- AWS Secret Access Key
Ensure that your AWS credentials are valid and the required permissions are assigned to your user account for ECR Public operations.
Use the provided upload_to_ecr.sh
script to push the Docker image to Amazon ECR Public:
./upload_to_ecr.sh
Ensure that the upload_to_ecr.sh
script is executable:
chmod +x upload_to_ecr.sh
The image is publicly available and does not require authentication for pulling.
To pull the Docker image from Amazon ECR Public to your local Docker installation, use the following command:
docker pull public.ecr.aws/t8q6o3x2/tuamnuq-liar-detect-app:latest
The image will now be available locally and can be verified using:
docker images
LIAR-Detect/
│
├── data/ # Dataset directory
│ ├── train.tsv # Training data
│ ├── test.tsv # Test data
│
├── models/ # Model and pipeline directory
│ ├── feature_engineer.pkl
│ ├── xgboost_model.pkl
│
├── notebooks/
│ └── notebook.ipynb # Main notebook for data preparation, EDA, and model selection
│
├── src/ # Source code directory
│ ├── common_feature.py # Shared utilities and paths
│ ├── train.py # Script to train the final model
│ ├── predict.py # Script to serve predictions via a web service
│
├── tests/ # Test suite directory
│ ├── test_inference.py # Pytest for inference pipeline
│ ├── test_single_false.json # Test input JSON with false label
│ ├── test_single_mtrue.json # Test input JSON with mostly-true label
│
├── aws_ecs_deployment/ # CDK subproject for ECS Fargate deployment
├
├── Dockerfile # Docker configuration
├── Pipfile # Dependency management using Pipenv
├── Pipfile.lock # Lockfile for reproducibility
├── pyproject.toml # Pytest configuration
├── README.md # Project documentation
└── .gitignore # Ignored files and directories
- Best Model: XGBoost
- Accuracy on Test Data: 0.4301 (Default Parameters)
- Detailed performance metrics can be found in the project notebook (
notebook.ipynb
).
This project is licensed under the MIT License. See the LICENSE
file for details.