-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathcrime_deep_dive.Rmd
246 lines (177 loc) · 7.14 KB
/
crime_deep_dive.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
title: "Crime Data Dive"
author: "Nate Day"
date: "2/4/2018"
output:
ioslides_presentation:
incremental: yes
smaller: yes
slidy_presentation: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(magrittr)
library(tidyverse)
library(broom)
library(viridis)
library(sf)
library(tidycensus)
```
## Why drug crime? {.build}
- Cooler than parking tickets
- Access to ~31,000 observations, over 5 years!
- Have you seen The Wire?

## Goals of this talk {.build}
- Show distribution of crime in our city
- Provide an open-template for analyzing more data
- Get people excited about using R

## Data Process {.build}
1. Open Data Portal - Crime Reports
2. Geocode via Google Map API
3. Use `library(sf) %>% tidyverse` for spatial data exploration
4. Model and test patterns with `library(spdep)` and `glm()`
```{r libs, echo = TRUE, message = FALSE, warning = FALSE}
library(tidyverse) # duh
library(magrittr) # %<>% life
library(viridis) # it's pretty
library(sf) # the new kid
library(spdep) # the grandaddy
```
## Where is crime happening? {.build}
```{r crime_map, echo=FALSE}
my_github <- "https://github.com/NathanCDay/cville_crime/raw/master/data/"
crime_counts <- readRDS(gzcon(url(paste0(my_github,"crime_counts_sf.RDS"))))
census <- readRDS(gzcon(url(paste0(my_github,"census_sf.RDS"))))
ggplot(crime_counts) +
geom_sf(data = census) +
geom_sf(aes(size = n, color = n, alpha = n)) +
scale_color_viridis() +
facet_wrap(~drug_flag)
crime_counts %<>% st_set_geometry(NULL)
```
## Most frequent addresses {.smaller .build}
Top 3 address for drug crime and not drug crime
```{r most_frequent, echo = T}
arrange(crime_counts, -n) %>% group_by(drug_flag) %>% slice(1:3)
```
The police station's address is 606 E Market Street....
## What is going on at the police station? {.smaller .build}
"The answer is quite simple - when individuals walk in to the police department to file a report the physical address of the department (606 E Market Street) is often used in that initial report if no other known address is available at the time. This is especially true for incidents of found or lost property near the downtown mall where there is no true known incident location. The same is true for any warrant services that result in a police report occurring at the police department." - CPD
<div

## Test if proportions are different {.smaller .build}
```{r police_station_proportions, echo = TRUE}
station_props <- arrange(crime_counts, -n) %>%
group_by(drug_flag) %>%
add_count(wt = n) %>%
slice(1)
with(station_props, prop.test(n, nn)) %>% tidy
```
## Aggregate into areas {.smaller .build}
Census blocks make a lot of sense becuase:
+ Tons of data in Census and American Community Surveys
+ Reputable source with code books and APIs
+ Easy to access in R via ODP and `library(tidycensus)`

## Start to do it {.smaller .build}
```{r census_pop, echo = TRUE, warning=FALSE, message=FALSE, fig.height=4, cache = TRUE}
long_url <- "https://opendata.arcgis.com/datasets/e60c072dbb734454a849d21d3814cc5a_14.geojson"
census <- geojsonio::geojson_read(long_url, what = "sp") %>%
st_as_sf()
ggplot(census, aes(fill = HU_Vacant / Housing_Units)) +
geom_sf() + scale_fill_viridis()
```
## Keep doing it {.smaller .build}
+ Start with geocoded version
```{r raw_geo, message=FALSE, echo = TRUE}
crime <- read_csv("https://github.com/NathanCDay/cville_crime/raw/master/crime_geocode.csv")
crime %<>% filter(complete.cases(.))
crime %<>% filter(address != "600 E MARKET ST Charlottesville VA")
```
+ Convert to `sf`, with same CRS (critical)
```{r crime_sf, echo = TRUE}
crime %<>% st_as_sf(coords = c("lon", "lat"), crs = st_crs(census))
```
+ Use `sf::st_within()` and friends
```{r st_within, echo = TRUE, message = FALSE}
crime %<>% mutate(within = st_within(crime, census) %>% as.numeric) %>%
filter(!is.na(within))
```
There are bunch of other great `st_x(sf_a, sf_b)` functions too. If you want to do it, there's a tool for it.
## Done with it {.smaller .build}
+ Flag by interest
```{r drug_flag, echo = TRUE}
crime %<>% mutate(drug_flag = ifelse(grepl("drug", Offense, ignore.case = TRUE),
"drugs", "not_drugs"))
```
+ Summarise with `tidyverse`
```{r tidy_sum, echo = TRUE}
crime_block <- st_set_geometry(crime, NULL) %>% # remove geometry for spread() to work
group_by(within, drug_flag) %>%
count() %>%
spread(drug_flag, n) %>%
mutate(frac_drugs = drugs / sum(drugs + not_drugs)) %>%
ungroup() # geom_sf doesn't care for grouped dfs/tbls
```
+ Join in
```{r, echo = TRUE}
census %<>% inner_join(crime_block, by = c("OBJECTID" = "within"))
```
## Hot blocks {.smaller}
```{r, echo = TRUE}
ggplot(census, aes(fill = frac_drugs)) +
geom_sf() + scale_fill_viridis()
```
## Is it random? {.smaller}
Test with Moran's I statistic
```{r, fig.height = 3}
census_sp <- as(census, "Spatial")
block_wnb <- poly2nb(census_sp) %>% nb2listw
moran.mc(census_sp$frac_drugs, block_wnb, nsim = 999) %>% tidy
```
## Get more data {.smaller}
Get median `income` and `age` data from the American Community Survey via `tidycensus` to supplement housing and demographics from the Census.
```{r model_vars, echo=F, fig.height = 4, warning =F, message = F}
cvl <- get_acs(geography = "block group", county = "Charlottesville", state = "VA",
variables = c("B19013_001", "B01002_001") )
decode <- c("income", "age") %>% set_names(c("B19013_001", "B01002_001"))
cvl$variable %<>% decode[.]
cvl %<>% select(GEOID, variable, estimate) %>%
spread(variable, estimate)
# ggplot(cvl, aes(age, income)) +
# geom_point()
# missing values
# let's impute them by neighbors
cvl %<>% rename(BlockGroup = GEOID)
census %<>% full_join(cvl)
# sequester the missing values value
miss <- census %>% filter(is.na(income))
# calculate the mean its neightbors
miss$income <- st_touches(miss, census) %>% # return the row_ids for adjacent polygons
map_dbl(~ census[., ] %>% with(mean(income))) # calculate the means per missing block
# builder decoder
dc <- miss$income %>% set_names(miss$OBJECTID)
# back together again
census %<>% mutate(income = ifelse(is.na(income), dc[as.character(OBJECTID)], income))
census %<>% mutate(frac_black = Black / Population,
frac_vacant = HU_Vacant / Housing_Units)
# pred column positions for ggpairs()
pred_cols <- match(c("frac_drugs", "frac_vacant", "age", "income", "frac_black"), names(census))
library(GGally)
ggpairs(census, columns = pred_cols)
```
## What are the strong predictors? {.smaller .build}
Since the response is the proportion of drug crime to total, i.e. bounded by 0 and 1, we want to use the quasibinomial distribution instead of the Gaussian.
```{r mod_sum}
mod <- glm(frac_drugs ~ frac_black + frac_vacant + income + age,
data = census, family = quasibinomial())
summary(mod) %>% coefficients()
```
Test `resid(mod)` again for spatial correlation, a good model should have randomly dispearsed residuals.
```{r}
moran.mc(resid(mod), block_wnb, nsim = 999) %>% tidy
```
## Questions? | Thanks for listening
