LIAR-Detect: Fake News Statement Classification

LIAR-Detect is a machine learning project focused on classifying political statements into categories such as true, false, half-true, and others. Leveraging the LIAR dataset, this project explores text-based fake news detection and provides insights into the credibility of statements made by public figures. The trained model is deployed as a web service, ready for integration into fact-checking workflows or other applications.

Problem Description

The proliferation of misinformation and fake news, especially in the political domain, has made it challenging to discern credible statements. This project aims to develop a classification model capable of analyzing short political statements and categorizing them into truthfulness levels.

The dataset, LIAR, contains approximately 10,000 labeled statements with metadata such as the speaker, subject, and historical truthfulness records. By combining text analysis with metadata, the project aims to identify misinformation.

Dataset

The dataset used for this project is the LIAR Benchmark Dataset, introduced in William Yang Wang's paper, "Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection."

Dataset Features:

Text Data: Short political statements.
Metadata: Speaker, party affiliation, context, and historical truthfulness records.
Labels: true, mostly-true, half-true, false, barely-true, and pants-on-fire.

Source: ACL 2017 Paper

The dataset is automatically downloaded at the start of the Jupyter Notebook.

Project Workflow

Data Preparation and Exploratory Data Analysis (EDA):
- Handle missing values.
- Analyze label and feature distribution as well as statement lengths, topics and important words.
- Preprocess text data using TF-IDF vectorization and One-Hot encoding.
- Engineer features from metadata for enhanced model performance.
Model Training and Selection:
- Train and evaluate multiple models:
  - Logistic Regression
  - Random Forest Classifier
  - Linear SVM
  - Multinomial Naive Bayes
  - XGBoost
- Perform hyperparameter tuning using GridSearchCV.
- Select the best model based on cross-validation accuracy.
Model Evaluation:
- Assess performance on the test dataset.
- Generate classification reports and visualize confusion matrices.
Containerization:
- Package the Flask web service into a Docker container for portability.
- Push the container image to Amazon Elastic Container Registry (ECR) for centralized storage.
Cloud Deployment:
- Deploy the containerized service to AWS ECS Fargate for scalable and serverless hosting.
- Leverage spot instances for cost efficiency.
- Detailed deployment instructions can be found in the aws_ecs_deployment directory subproject's README.

Setup Instructions

Prerequisites

Python 3.12 or higher
pipenv for dependency management
Docker (optional, for containerization)

Installation

Clone the repository:

git clone [email protected]:tomtuamnuq/LIAR-Detect-Fake-News-Statement-Classification.git
cd LIAR-Detect-Fake-News-Statement-Classification

Install dependencies using pipenv:
```
pipenv install
```
Activate the Pipenv shell in the project root:
```
pipenv shell
```
Launch the Jupyter Notebook (if needed) with the correct working directory:
```
jupyter lab --notebook-dir=notebooks
```
Execute at least the first cell of the Jupyter Notebook to load the dataset into the data directory.

Training the Model

To train the model, use the train.py script located in the src directory. This will preprocess the data, train the model, and save the required pickle files (feature_engineer.pkl and xgboost_model.pkl) in the models directory. Ensure that the train.tsv and test.tsv files exist in the data directory before training!

Run the training script:

python src/train.py

Once complete, the models/ directory will contain the following:

feature_engineer.pkl: The pickled feature engineering pipeline.
xgboost_model.pkl: The trained XGBoost model.

Testing the Python implementation

To test the trained model inference implementation, use pytest to run the tests in the tests directory. These tests verify model inference, schema validation, and the prediction pipeline.

Run the tests:

pytest tests

Make sure that the trained model files (feature_engineer.pkl and xgboost_model.pkl) exist in the models directory before running the tests.

Running the Application in a Docker Container

Follow these steps to build, run, and test the Flask application using Docker:

Step 1: Build the Docker Image

Ensure you are in the project root directory and run the following command to build the Docker image:

docker build -t liar-detect-app .

Step 2: Run the Docker Container

Start the container and expose it on port 5042:

docker run -p 5042:5042 liar-detect-app

The application will now be accessible at http://127.0.0.1:5042.

Step 3: Test the Flask API

You can test the /predict endpoint using a test JSON file. For example, to test with test_single_false.json, run:

curl -X POST -H "Content-Type: application/json" -d @tests/test_single_false.json http://127.0.0.1:5042/predict

The API should respond with a JSON object containing the predicted label. For example:

{
  "predicted_label": "pants-fire",
  "true_label": "pants-fire",
  "correct": true
}

Using the AWS CLI for Docker Image Management

This subsection explains how to set up the AWS CLI on Arch Linux, push the Docker image to Amazon ECR Public, and retrieve the image locally.

Step 1: Install the AWS CLI

On Arch Linux, you can install the AWS CLI using the aws-cli package:

yay -S aws-cli

Step 2: Configure the AWS CLI

Set up the AWS CLI with your credentials and default region:

aws configure

You will be prompted to enter:

AWS Access Key ID
AWS Secret Access Key

Ensure that your AWS credentials are valid and the required permissions are assigned to your user account for ECR Public operations.

Step 3: Upload the Docker Image

Use the provided upload_to_ecr.sh script to push the Docker image to Amazon ECR Public:

./upload_to_ecr.sh

Ensure that the upload_to_ecr.sh script is executable:

chmod +x upload_to_ecr.sh

The image is publicly available and does not require authentication for pulling.

Step 4: Retrieve the Docker Image

To pull the Docker image from Amazon ECR Public to your local Docker installation, use the following command:

docker pull public.ecr.aws/t8q6o3x2/tuamnuq-liar-detect-app:latest

The image will now be available locally and can be verified using:

docker images

Project Structure

LIAR-Detect/
│
├── data/                # Dataset directory
│   ├── train.tsv        # Training data
│   ├── test.tsv         # Test data
│
├── models/              # Model and pipeline directory
│   ├── feature_engineer.pkl
│   ├── xgboost_model.pkl
│
├── notebooks/
│   └── notebook.ipynb   # Main notebook for data preparation, EDA, and model selection
│
├── src/                 # Source code directory
│   ├── common_feature.py # Shared utilities and paths
│   ├── train.py         # Script to train the final model
│   ├── predict.py       # Script to serve predictions via a web service
│
├── tests/               # Test suite directory
│   ├── test_inference.py # Pytest for inference pipeline
│   ├── test_single_false.json  # Test input JSON with false label
│   ├── test_single_mtrue.json  # Test input JSON with mostly-true label
│
├── aws_ecs_deployment/ # CDK subproject for ECS Fargate deployment
├
├── Dockerfile           # Docker configuration
├── Pipfile              # Dependency management using Pipenv
├── Pipfile.lock         # Lockfile for reproducibility
├── pyproject.toml       # Pytest configuration
├── README.md            # Project documentation
└── .gitignore           # Ignored files and directories

Results

Best Model: XGBoost
Accuracy on Test Data: 0.4301 (Default Parameters)
Detailed performance metrics can be found in the project notebook (notebook.ipynb).

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LIAR-Detect: Fake News Statement Classification

Problem Description

Dataset

Project Workflow

Setup Instructions

Prerequisites

Installation

Training the Model

Testing the Python implementation

Running the Application in a Docker Container

Step 1: Build the Docker Image

Step 2: Run the Docker Container

Step 3: Test the Flask API

Using the AWS CLI for Docker Image Management

Step 1: Install the AWS CLI

Step 2: Configure the AWS CLI

Step 3: Upload the Docker Image

Step 4: Retrieve the Docker Image

Project Structure

Results

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.vscode		.vscode
aws_ecs_deployment		aws_ecs_deployment
models		models
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
pyproject.toml		pyproject.toml
repository-description.json		repository-description.json
upload_to_ecr.sh		upload_to_ecr.sh

License

tomtuamnuq/LIAR-Detect-Fake-News-Statement-Classification

Folders and files

Latest commit

History

Repository files navigation

LIAR-Detect: Fake News Statement Classification

Problem Description

Dataset

Project Workflow

Setup Instructions

Prerequisites

Installation

Training the Model

Testing the Python implementation

Running the Application in a Docker Container

Step 1: Build the Docker Image

Step 2: Run the Docker Container

Step 3: Test the Flask API

Using the AWS CLI for Docker Image Management

Step 1: Install the AWS CLI

Step 2: Configure the AWS CLI

Step 3: Upload the Docker Image

Step 4: Retrieve the Docker Image

Project Structure

Results

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages