This is a statistical analysis of criminal activities in vancouver dataset available from kaggle. Also, different machine learning approaches are being applied to the dataset for prediction modelling. (Under-Development)
**Link to dataset: ** https://www.kaggle.com/wosaku/crime-in-vancouver/data
Files to know
data_analysis.py
This files provides all the necessary functions to generate initial results from the dataset.
Running the analysis function inside data_analysis.py generates the different types of columns and the different types of criminal activities reported.
It also shows the number of Neighborhoods and their names
This shows the per type count values for th various criminal activities reported.
This also provides an interactive way to input the year and month to generate month-specific criminal activities reported.
The distribution of crimes per day is given here. It is a normal distribution with an outlier over 600 and has a mean at around 95
The crime_time_series analysis tries to demonstrate how the number of criminal activities varied within 2STDs
Here, we are introduced a new column named "Classification" which is set to +1 if the criminal activity occured via Vehicle Collision else everything else is set to -1. We ran the ML models over 50,000 data-points due to memory constraints on our computing systems.
Logistic Regression
Our main reason to apply logistic regression over linear regression is to regress over a categorical outcome(+1,-1)
Results
ROC Curve:
Decision Tree
We decide to run the decision tree technique which splits the dataset based on feature selection by calculating the entropy and information gain