- A model to determine the “bias label” of each paragraph in a news article
- This project would focus only on Political Bias
- Output Labels of the Model: Biased, Unbiased
- Project does not consider other types of bias:
- Overview of Different types of Bias (https://github.com/amazon-research/bold)
- Spread awareness of the legitimacy of a text
- Help readers make informed decisions and form accurate opinions on current issues
- Reduce the epidemic of misinformation
- Dataset used: NELA-GT-2019 (Harvard dataset)
- 1.12M news articles from 260 sources
- Collected between January 1st 2019 and December 31 2019
- Label that is important to this project from this dataset
- Aggregate Label: Reliable, mixed, or Unreliable categorized by article source
- Use Web Scraping Library to extract cleaned article content and replace the “content” column in SQLite database.
- Create a dataframe from the SQLite Database
- Columns:
- Content (article body)
- Label
- 1 for biased (corresponds to 2 in the dataset column labeled aggregate data)
- 0 for unbiased (corresponds to 0 in the dataset column labeled aggregate data)
- Columns:
- Download the BERTSentence pre-trained model (need to determine the specific model)
- Finetune the BERTSentence model using our dataset
- Encode all article sentences into vectors using the BERTSentence model
- Feed encoded sentences into a CNN
- Train CNN model based on labels with the training data
- Test CNN model with the testing data
- Sentence Embedding: Vector for each sentence in the article; Finding the semantics in a context
- SentenceBERT
- BERT is a pre-trained model that understands nuances of the English language
- Fine Tuning: We train BERT again on our data to find bias-specific features of the English language
- Label Detection and Classification using CNN
- Extract the “bias” features from a paragraph
- Below is example of the workflow:
- SentenceBERT
- Python Script to make Dataframe from SQLite DB
- Web Scraping Library for scraping News Articles
- Reason: NELA-GT-2019 Dataset scraping is poorly executed
- https://github.com/arnavn101/WebXplore
- SentenceBERT
- CNN