07_Crime_Hotspot_Map.Rmd

---
title: "Making a Hotspot Map"
author:
- affiliation: University of Pennsylvania
  email: gridge@upenn.edu
  name: Greg Ridgeway
- affiliation: University of Pennsylvania
  email: moyruth@upenn.edu
  name: Ruth Moyer
- affiliation: University of Pennsylvania
  email: gohl@upenn.edu
  name: Li Sian Goh
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  html_document:
    css: htmlstyle.css
---

<!-- HTML YAML header Ctrl-Shift-C to comment/uncomment -->


<!-- PDF YAML header Ctrl-Shift-C to comment/uncomment -->
<!-- --- -->
<!-- title: "Making a Hotspot Map" -->
<!-- author: -->
<!-- - Greg Ridgeway (gridge@upenn.edu) -->
<!-- - Ruth Moyer (moyruth@upenn.edu) -->
<!-- date: "`r format(Sys.time(), '%B %d, %Y')`" -->
<!-- output: -->
<!--   pdf_document: -->
<!--     latex_engine: pdflatex -->
<!--   html_document: default -->
<!-- fontsize: 11pt -->
<!-- fontfamily: mathpazo -->
<!-- --- -->

<!-- Make RMarkdown cache the results -->
```{r echo=FALSE}
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, cache.lazy=FALSE, out.width='100%')
```

<!-- A function for automating the numbering and wording of the exercise questions -->
```{r echo=FALSE}
.counterExercise <- 0
.exerciseQuestions <- NULL
.exNum <- function(.questionText="") 
{
   .counterExercise <<- .counterExercise+1
   .questionText <- gsub("@@", "`", .questionText)
   .exerciseQuestions <<- c(.exerciseQuestions, .questionText)
   return(paste0(.counterExercise,". ",.questionText))
}
.exQ <- function(i)
{
   return( paste0(i,". ",.exerciseQuestions[i]) )
}
```


<!-- This document can't be created after running Lesson #8 SQL part 2. To recreate this document, first rerun Lesson #6 SQL Part 1. -->

# Introduction

Now that we've learned how to work with SQL, we can make a hotspot map with a large dataset. The concept of determining the geographic locations where crime is most intense - crime hotspots - is very important to criminological theory as well as to the very practical question of where police officers should be deployed. 

Here, we will make hotspot maps with our Chicago crime data (which requires use of SQL). Of course, if you had a small dataset, you could use these same methods to make a hotspot map using only R. 


# Setting up the data
Let's first load up the `sqldf` library and reconnect to our Chicago crime database.

```{r comment="", results='hold', cache=FALSE} 
library(sqldf)
con <- dbConnect(SQLite(), dbname="chicagocrime.db")
```

Let's run a SQL query to extract the two columns, `Latitude` and `Longitude` from our crime database, creating a datafame with one row for each crime incident in the dataset.
```{r comment="", results='hold'} 
res <- dbSendQuery(con, "
                   SELECT Latitude, Longitude
                   FROM crime
                   WHERE Latitude  IS NOT NULL AND
                         Longitude IS NOT NULL")
dataAllcrime <- fetch(res, n = -1)
dbClearResult(res)
```

How would your query differ if you wanted to do a hotspot map of only assaults? 
```{r comment="", results='hold'} 
res <- dbSendQuery(con, "
                   SELECT Latitude,Longitude
                   FROM crime
                   WHERE [Primary.Type]='ASSAULT' AND
                         Latitude  IS NOT NULL AND
                         Longitude IS NOT NULL")
dataAssaults <- fetch(res, n = -1)
dbClearResult(res)
```

# Creating the maps
We'll be using two packages to make our maps, `ggmap` and `hexbin`, so be sure you have those installed.
```{r comment="", results='hold', cache=FALSE}
library(ggmap)
library(hexbin)
```

Next, we'll need to get a map of Chicago. Google Maps used to be readily available through `ggmap`, but as of June 2018 the account requirements became more complicated for Google Maps. We will use other map services instead here. Stamen is a design organization that provides maps of several styles. Here we define the bounding box for the City of Chicago.

```{r comment="", results='hold', message=FALSE}
chicago.map <- ggmap(get_map(c(-88.0, 41.6, -87.5, 42.1), source="stamen"))
chicago.map
```

There are other styles of Stamen maps including `toner`
```{r comment="", results='hold', message=FALSE}
print(ggmap(get_map(c(-88.0, 41.6, -87.5, 42.1), source="stamen", maptype="toner")))
```

and `watercolor`.
```{r comment="", results='hold', message=FALSE}
print(ggmap(get_map(c(-88.0, 41.6, -87.5, 42.1), source="stamen", maptype="watercolor")))
```

Let's plot as points on the Chicago map, a random sample of 100,000 rows from our dataset. 
```{r comment="", results='hold'}
i <- sample(1:nrow(dataAllcrime), 100000)
chicago.map +
  geom_point(aes(x=Longitude,y=Latitude), 
             data  = dataAllcrime[i,],
             alpha = 0.5, 
             color = "darkred", 
             size  = 1)
```
It doesn't look terribly useful because it is just a giant red splotch. These are all the points where a crime occurred. There's no density to show us the high crime or low crime areas. But, we're making process with getting points onto a map!

Let's instead try plotting hexagonal bins, which the `hexbin` package let us do. Why hexagons? It turns out that squares tend to have problems in the corners. The precision is low there and our eye tends to get drawn to the parallel lines the square grid makes. Ideally, we want to use a shape that has a small perimeter-to-area ratio. Circles have the smallest perimeter-to-area ratio, but we can't use circles to tile over the map. Hexagons fall in between squares and circles.

Note: when you run the following lines of code, you will get the following message: "Coordinate system already present. Adding new coordinate system, which will replace the existing one." Don't panic, this message is supposed to appear. 
```{r comment="", results='hold'}
chicago.map +
  coord_cartesian() +
  stat_binhex(aes(x=Longitude, y=Latitude),
              bins = 60, 
              data = dataAllcrime)
```

Or try the following. Setting `alpha=2/4` makes the bins a little more transparent. Try comparing with `alpha=1/4` or `alpha=3/4`.

```{r comment="", results='hold'}
chicago.map +
  coord_cartesian() +
  stat_binhex(aes(x=Longitude, y=Latitude),
              bins  = 60, 
              alpha = 0.5, 
              data  = dataAllcrime)
```

The previous map was a default monochromatic color (blue). You can change the color gradient. To see what colors are available, type `colors()`.

```{r comment="", results='hold'}
chicago.map +
  coord_cartesian() +
  stat_binhex(aes(x=Longitude, y=Latitude),
              bins  = 60, 
              alpha = 0.5, 
              data = dataAllcrime) +
  scale_fill_gradient('Crime Density',
                      low="springgreen",
                      high="hotpink")
```

You can also create a map that shows the high crime areas in a way that is akin to elevation on a topographic map. The more concentrated the concentric areas, the higher the crime (or, in the case of a topographic map, the higher the terrain). Note that a plot using `stat_density2d()` requires a lot of memory. That's why it's a good idea to use a sample of cases.

```{r comment="", results='hold'}
i <- sample(1:nrow(dataAllcrime),100000)
chicago.map +
  stat_density2d(aes(x=Longitude,y=Latitude),
                 bins = 5, 
                 data = dataAllcrime[i,], 
                 geom = 'density2d',
                 col  = 'white')
```

Let's make a hotspot map with a gradient, red for high crime and green for low crime. 

```{r comment="", results='hold'}
chicago.map +
  stat_density2d(aes(x=Longitude, y=Latitude,
                     fill=..level..,
                     alpha=..level..),
                 data = dataAllcrime[i,], 
                 geom = 'polygon') +
  scale_fill_gradient('Crime Density',
                      low  = "green",
                      high = "red") +
  scale_alpha(range = c(.4, .75), 
              guide = FALSE) +
  guides(fill = guide_colorbar(barwidth = 1.5, barheight = 10))
```

We can use `bins` to control the number of levels of colors plotted.
```{r comment="", results='hold'}
chicago.map +
  stat_density2d(aes(x=Longitude, y=Latitude,
                     fill=..level..,
                     alpha=..level..),
                 bins = 20,
                 data = dataAllcrime[i,],
                 geom = 'polygon') +
  scale_fill_gradient('Crime Density',
                      low  = "green",
                      high = "red") +
  scale_alpha(range = c(.4, .75),
              guide = FALSE) +
  guides(fill = guide_colorbar(barwidth = 1.5, barheight = 10))
```

Let's create hotspot maps with just motor vehicle theft, and let's make a map for each year between 2005 and 2014. Note the line with the SQL function `SUBSTR()`. The `SUBSTR()` command helps us grab a section of a string. Here's an example. Take a look at the `Date` column in our initial dataset. 
```{r comment="", results='hold'}
res <- dbSendQuery(con, "SELECT Date 
                         FROM crime
                         LIMIT 5")
fetch(res, n = -1)
dbClearResult(res)
```
We want only the year part of these dates. The year starts with the 7th character and has a length of 4 characters. So to extract just the year we select `SUBSTR(Date, 7, 4)`.

```{r comment="", results='hold'}
res <- dbSendQuery(con, "
                   SELECT Latitude, 
                          Longitude,
                          SUBSTR(Date,7,4) AS year
                   FROM crime
                   WHERE [Primary.Type]='MOTOR VEHICLE THEFT' AND
                         Latitude  IS NOT NULL AND
                         Longitude IS NOT NULL")
dataCarTheft <- fetch(res, n = -1)
dbClearResult(res)

for(year0 in 2005:2017)
{
  print(
    chicago.map +
       stat_density2d(aes(x=Longitude, y=Latitude,
                         fill=..level..,
                         alpha=..level..),
                     size = 2,
                     bins = 4,
                     geom = 'polygon',
                     data = subset(dataCarTheft, year==year0)) +
       scale_fill_gradient(year0) +
       scale_alpha(range = c(.4, .75), guide = FALSE) +
       guides(fill = guide_colorbar(barwidth = 1.5, barheight = 10))
  )
  # Sys.sleep(5) # uncomment this if you want a 5 second pause between plots
}

```

Many cities other than Chicago have accessible incident-level data so that you can try to make a hotspot map. Cities include Jersey City, Philadelphia, Los Angeles, Seattle, San Francisco, Baltimore, and Washington, DC.