-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
workshop material in one little repo
- Loading branch information
1 parent
95be528
commit 920b399
Showing
9 changed files
with
1,062 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,246 @@ | ||
--- | ||
title: "Crime Data Dive" | ||
author: "Nate Day" | ||
date: "2/4/2018" | ||
output: | ||
ioslides_presentation: | ||
incremental: yes | ||
smaller: yes | ||
slidy_presentation: default | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = FALSE) | ||
library(magrittr) | ||
library(tidyverse) | ||
library(broom) | ||
library(viridis) | ||
library(sf) | ||
library(tidycensus) | ||
``` | ||
|
||
## Why drug crime? {.build} | ||
|
||
- Cooler than parking tickets | ||
- Access to ~31,000 observations, over 5 years! | ||
- Have you seen The Wire? | ||
|
||
 | ||
|
||
## Goals of this talk {.build} | ||
|
||
- Show distribution of crime in our city | ||
- Provide an open-template for analyzing more data | ||
- Get people excited about using R | ||
|
||
 | ||
|
||
## Data Process {.build} | ||
|
||
1. Open Data Portal - Crime Reports | ||
2. Geocode via Google Map API | ||
3. Use `library(sf) %>% tidyverse` for spatial data exploration | ||
4. Model and test patterns with `library(spdep)` and `glm()` | ||
|
||
```{r libs, echo = TRUE, message = FALSE, warning = FALSE} | ||
library(tidyverse) # duh | ||
library(magrittr) # %<>% life | ||
library(viridis) # it's pretty | ||
library(sf) # the new kid | ||
library(spdep) # the grandaddy | ||
``` | ||
|
||
## Where is crime happening? {.build} | ||
|
||
```{r crime_map, echo=FALSE} | ||
my_github <- "https://github.com/NathanCDay/cville_crime/raw/master/data/" | ||
crime_counts <- readRDS(gzcon(url(paste0(my_github,"crime_counts_sf.RDS")))) | ||
census <- readRDS(gzcon(url(paste0(my_github,"census_sf.RDS")))) | ||
ggplot(crime_counts) + | ||
geom_sf(data = census) + | ||
geom_sf(aes(size = n, color = n, alpha = n)) + | ||
scale_color_viridis() + | ||
facet_wrap(~drug_flag) | ||
crime_counts %<>% st_set_geometry(NULL) | ||
``` | ||
|
||
|
||
## Most frequent addresses {.smaller .build} | ||
Top 3 address for drug crime and not drug crime | ||
```{r most_frequent, echo = T} | ||
arrange(crime_counts, -n) %>% group_by(drug_flag) %>% slice(1:3) | ||
``` | ||
|
||
The police station's address is 606 E Market Street.... | ||
|
||
## What is going on at the police station? {.smaller .build} | ||
|
||
"The answer is quite simple - when individuals walk in to the police department to file a report the physical address of the department (606 E Market Street) is often used in that initial report if no other known address is available at the time. This is especially true for incidents of found or lost property near the downtown mall where there is no true known incident location. The same is true for any warrant services that result in a police report occurring at the police department." - CPD | ||
|
||
<div | ||
 | ||
|
||
## Test if proportions are different {.smaller .build} | ||
|
||
```{r police_station_proportions, echo = TRUE} | ||
station_props <- arrange(crime_counts, -n) %>% | ||
group_by(drug_flag) %>% | ||
add_count(wt = n) %>% | ||
slice(1) | ||
with(station_props, prop.test(n, nn)) %>% tidy | ||
``` | ||
|
||
## Aggregate into areas {.smaller .build} | ||
|
||
Census blocks make a lot of sense becuase: | ||
|
||
+ Tons of data in Census and American Community Surveys | ||
+ Reputable source with code books and APIs | ||
+ Easy to access in R via ODP and `library(tidycensus)` | ||
|
||
 | ||
|
||
## Start to do it {.smaller .build} | ||
```{r census_pop, echo = TRUE, warning=FALSE, message=FALSE, fig.height=4, cache = TRUE} | ||
long_url <- "https://opendata.arcgis.com/datasets/e60c072dbb734454a849d21d3814cc5a_14.geojson" | ||
census <- geojsonio::geojson_read(long_url, what = "sp") %>% | ||
st_as_sf() | ||
ggplot(census, aes(fill = HU_Vacant / Housing_Units)) + | ||
geom_sf() + scale_fill_viridis() | ||
``` | ||
|
||
## Keep doing it {.smaller .build} | ||
|
||
+ Start with geocoded version | ||
```{r raw_geo, message=FALSE, echo = TRUE} | ||
crime <- read_csv("https://github.com/NathanCDay/cville_crime/raw/master/crime_geocode.csv") | ||
crime %<>% filter(complete.cases(.)) | ||
crime %<>% filter(address != "600 E MARKET ST Charlottesville VA") | ||
``` | ||
|
||
+ Convert to `sf`, with same CRS (critical) | ||
```{r crime_sf, echo = TRUE} | ||
crime %<>% st_as_sf(coords = c("lon", "lat"), crs = st_crs(census)) | ||
``` | ||
|
||
+ Use `sf::st_within()` and friends | ||
```{r st_within, echo = TRUE, message = FALSE} | ||
crime %<>% mutate(within = st_within(crime, census) %>% as.numeric) %>% | ||
filter(!is.na(within)) | ||
``` | ||
|
||
There are bunch of other great `st_x(sf_a, sf_b)` functions too. If you want to do it, there's a tool for it. | ||
|
||
## Done with it {.smaller .build} | ||
|
||
+ Flag by interest | ||
```{r drug_flag, echo = TRUE} | ||
crime %<>% mutate(drug_flag = ifelse(grepl("drug", Offense, ignore.case = TRUE), | ||
"drugs", "not_drugs")) | ||
``` | ||
|
||
+ Summarise with `tidyverse` | ||
|
||
```{r tidy_sum, echo = TRUE} | ||
crime_block <- st_set_geometry(crime, NULL) %>% # remove geometry for spread() to work | ||
group_by(within, drug_flag) %>% | ||
count() %>% | ||
spread(drug_flag, n) %>% | ||
mutate(frac_drugs = drugs / sum(drugs + not_drugs)) %>% | ||
ungroup() # geom_sf doesn't care for grouped dfs/tbls | ||
``` | ||
|
||
+ Join in | ||
```{r, echo = TRUE} | ||
census %<>% inner_join(crime_block, by = c("OBJECTID" = "within")) | ||
``` | ||
|
||
## Hot blocks {.smaller} | ||
|
||
```{r, echo = TRUE} | ||
ggplot(census, aes(fill = frac_drugs)) + | ||
geom_sf() + scale_fill_viridis() | ||
``` | ||
|
||
## Is it random? {.smaller} | ||
|
||
Test with Moran's I statistic | ||
|
||
```{r, fig.height = 3} | ||
census_sp <- as(census, "Spatial") | ||
block_wnb <- poly2nb(census_sp) %>% nb2listw | ||
moran.mc(census_sp$frac_drugs, block_wnb, nsim = 999) %>% tidy | ||
``` | ||
|
||
## Get more data {.smaller} | ||
Get median `income` and `age` data from the American Community Survey via `tidycensus` to supplement housing and demographics from the Census. | ||
|
||
```{r model_vars, echo=F, fig.height = 4, warning =F, message = F} | ||
cvl <- get_acs(geography = "block group", county = "Charlottesville", state = "VA", | ||
variables = c("B19013_001", "B01002_001") ) | ||
decode <- c("income", "age") %>% set_names(c("B19013_001", "B01002_001")) | ||
cvl$variable %<>% decode[.] | ||
cvl %<>% select(GEOID, variable, estimate) %>% | ||
spread(variable, estimate) | ||
# ggplot(cvl, aes(age, income)) + | ||
# geom_point() | ||
# missing values | ||
# let's impute them by neighbors | ||
cvl %<>% rename(BlockGroup = GEOID) | ||
census %<>% full_join(cvl) | ||
# sequester the missing values value | ||
miss <- census %>% filter(is.na(income)) | ||
# calculate the mean its neightbors | ||
miss$income <- st_touches(miss, census) %>% # return the row_ids for adjacent polygons | ||
map_dbl(~ census[., ] %>% with(mean(income))) # calculate the means per missing block | ||
# builder decoder | ||
dc <- miss$income %>% set_names(miss$OBJECTID) | ||
# back together again | ||
census %<>% mutate(income = ifelse(is.na(income), dc[as.character(OBJECTID)], income)) | ||
census %<>% mutate(frac_black = Black / Population, | ||
frac_vacant = HU_Vacant / Housing_Units) | ||
# pred column positions for ggpairs() | ||
pred_cols <- match(c("frac_drugs", "frac_vacant", "age", "income", "frac_black"), names(census)) | ||
library(GGally) | ||
ggpairs(census, columns = pred_cols) | ||
``` | ||
|
||
## What are the strong predictors? {.smaller .build} | ||
|
||
Since the response is the proportion of drug crime to total, i.e. bounded by 0 and 1, we want to use the quasibinomial distribution instead of the Gaussian. | ||
|
||
```{r mod_sum} | ||
mod <- glm(frac_drugs ~ frac_black + frac_vacant + income + age, | ||
data = census, family = quasibinomial()) | ||
summary(mod) %>% coefficients() | ||
``` | ||
|
||
Test `resid(mod)` again for spatial correlation, a good model should have randomly dispearsed residuals. | ||
|
||
```{r} | ||
moran.mc(resid(mod), block_wnb, nsim = 999) %>% tidy | ||
``` | ||
|
||
|
||
|
||
## Questions? | Thanks for listening | ||
|
||
 |
Large diffs are not rendered by default.
Oops, something went wrong.
Oops, something went wrong.