Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organization and news agencies).
This Project builds a learning model that classifies Tweets as disaster and no disaster.
- Data Collection and Cleaning
- Exploratory Data Analysis
- Pre Processing
- Modeling
- Inferential Visualizations
- Conclusions and Recommendation
This dataset was created by the company figure-eight and originally shared on their ‘Data For Everyone’ website here.
Tweet source: (https://twitter.com/AnyOtherAnnaK/status/629195955506708480)
This step involves the following:
- Import and Read data - Reading the csv file
- Data Visualization - Creating histogram and word cloud
- Baseline Accuracy - Calculation
This step involves the following methods :
- Tokenizing - splitting data into distinct chunks
- Removing Stopwords- Removing commonly used words/stop words as they take up space and processing time
- Lemmatizing - return the base/dictionary form of a word
This step creates three models and compares them.
- Logistic Regression Model
- Naive Bayes Model
Train and Test Scores:
Model | Train Score | Test Score |
---|---|---|
Logistic Regression Model | 0.8894727623051323 | 0.7977941176470589 |
Naive Bayes Model | 0.7836748992818356 | 0.773109243697479 |
Confusion Matrix Result:
Model | False Positives | False Negatives |
---|---|---|
Logistic Regression Model | 0 | 3 |
Naive Bayes Model | 8 | 0 |
Decision Tree Model | 20 | 15 |
- Creating a desicion tree with Labels
- Creating word clouds for subreddit - Bookclub and Cooking
A sucessful model was bult and score was submitted to kaggle.