10_PPD_shootings_extracting_from_text_geocoding.Rmd

---
title: "Extracting data from text and geocoding"
author:
- affiliation: University of Pennsylvania
  email: gridge@upenn.edu
  name: Greg Ridgeway
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  html_document:
    css: htmlstyle.css
editor_options: 
  chunk_output_type: console
---

<!-- HTML YAML header Ctrl-Shift-C to comment/uncomment -->

<!-- --- -->
<!-- title: "Extracting data from text and geocoding" -->
<!-- author: -->
<!-- - Greg Ridgeway (gridge@upenn.edu) -->
<!-- date: "`r format(Sys.time(), '%B %d, %Y')`" -->
<!-- output: -->
<!--   pdf_document: -->
<!--     latex_engine: pdflatex -->
<!--   html_document: default -->
<!-- fontsize: 11pt -->
<!-- fontfamily: mathpazo -->
<!-- --- -->

<!-- PDF YAML header Ctrl-Shift-C to comment/uncomment -->

<!-- Make RMarkdown cache the results -->
```{r echo=FALSE, cache=FALSE}
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, cache.lazy=FALSE, out.width='100%')
library(leaflet)
```

<!-- A function for automating the numbering and wording of the exercise questions -->
```{r echo=FALSE}
.counterExercise <- 0
.exerciseQuestions <- NULL
.exNum <- function(.questionText="") 
{
   .counterExercise <<- .counterExercise+1
   .questionText <- gsub("@@", "`", .questionText)
   .exerciseQuestions <<- c(.exerciseQuestions, .questionText)
   return(paste0(.counterExercise,". ",.questionText))
}
.exQ <- function(i)
{
   return( paste0(i,". ",.exerciseQuestions[i]) )
}
```

# Introduction

In this section, we are going to explore officer-involved shootings (OIS) in Philadelphia. The Philadelphia Police Department posts a lot of information about officer-involved shootings online going back to 2007. Have a look at their [OIS webpage](http://www.phillypolice.com/ois/). While a lot of information has been posted to the webpage, more information is buried in pdf files associated with each of the incidents. In order for us to explore these data, we are going to scrape the basic information from the webpage, have R dig into the pdf files for any incidents missing incident dates, clean up addresses using regular expressions, geocode the addresses to latitude/longitude using the OpenStreetMap geocoder and the ArcGIS geocoder (using JSON), and then make maps describing the shootings.

# Scraping the OIS data

Let's start by grabbing the raw HTML from the PPD OIS webpage
```{r comment="", results='hold'}
a <- scan("http://www.phillypolice.com/ois/", what="", sep="\n")
```
`scan()` is a very simple function that just pulls in text from a file or URL. It does not attempt to do any formatting. `what=""` tells `scan()` to treat what it is reading in as text and `sep="\n"` tells `scan()` to break the text apart whenever it encounters a line feed character.

The first several elements of `a` are just HTML code setting up the page.
```{r comment="", results='hold'}
a[1:4]
```
But further on down you will find HTML code containing the OIS information that we seek. Let's look at one of the 2018 OISs.
```{r comment="", results='hold'}
i <- grep("id=\"2018-2954", a)
a[i + 0:9]
```
For the data from 2013-2020, each table row related to an OIS starts with something like `<tr id="2018-2954">`. The very next row contains the URL of the pdf containing more detailed data. The third row contains the date and the fourth row contains the address. There are additional cells indicating injuries and how the shooting was adjudicated, but we will not work with these in this exercise.

Note that if we want the date, it is always two elements after the `<tr>` tag. We can use `gsub()` to strip away the unwanted HTML tags and whitespace.
```{r comment="", results='hold'}
a[i + 2]
gsub("[[:space:]]|<[^>]*>","",a[i + 2])
```

We'll use this same strategy for all of the shootings and for the OISs id, date, location, and URL. Start by using  `grep()` to find all of the lines of HTML code that start off a row for an OIS form 2013-2018 (the data for OISs before 2013 look a little different).
```{r comment="", results='hold'}
i <- grep("id=\"20(13|14|15|16|17|18|19|20)",a)
ois <- data.frame(
  id      =gsub("<[^>]*>", "", a[i+1]),
  date    =gsub("<[^>]*>", "", a[i+2]),
  location=gsub("[[:space:]]*<[^>]*>", "", a[i+3]),
  url     =gsub("[[:space:]]*<td><a href=\"(.*)\" class=.*", "\\1", a[i+1]))
```

For shootings between 2007-2012, the table provides no incident date and the incident location is 2 elements after the `<tr>` tag rather than 3.
```{r comment="", results='hold'}
i <- grep("id=\"20(07|08|09|10|11|12)",a)
temp <- data.frame(
  id      =gsub("<[^>]*>","",a[i+1]),
  date    =NA,
  location=gsub("<[^>]*>","",a[i+2]),
  url     =gsub("<td><a href=\"(.*)\" class=.*","\\1",a[i+1]))
```

Now we can stack all the data from 2007-2020 together and clear out some extra spaces and tabs in the `id` column.
```{r comment="", results='hold'}
ois <- rbind(ois, temp)
ois$id<- gsub("[[:space:]]*","",ois$id)
ois$date<- gsub("[[:space:]]*","",ois$date)
ois[1:5,]
```

Everything from the PPD OIS page is now neatly stored in an R data frame.

A couple of the entries have problematic `id`, some easily fixed.
```{r comment="", results='hold'}
grep("^[^-]*$", ois$id, value=TRUE)
ois$id[ois$id=="1630"] <- "16-30"
ois$id[ois$id=="1822"] <- "18-22"
ois$id[ois$id=="1911"] <- "19-11"
ois$id[ois$id=="1914"] <- "19-14"
```
While we are here, we also need to fix the URL for these OISs since they have a space instead of a hyphen in them.
```{r comment="", results='hold'}
ois$url[ois$id=="16-30"] <- "https://ois.phillypolice.com/16-30" 
ois$url[ois$id=="18-22"] <- "https://ois.phillypolice.com/18-22" 
ois$url[ois$id=="19-11"] <- "https://ois.phillypolice.com/19-11" 
ois$url[ois$id=="19-14"] <- "https://ois.phillypolice.com/19-14" 
```
A few other OISs with missing `id` are a little harder to figure out.
```{r comment="", results='hold'}
subset(ois, id=="")
```
The Willows Avenue OIS has a missing `id`, but the one before it has `id` 14-12 and the one after it has `id` 14-16.
```{r comment="", results='hold'}
i <- grep("5100 block of Willows Avenue", ois$location)
ois[(i-1):(i+1),]
```
Let's see if there is a pdf for 14-13, 14-14, or 14-15.
```{r comment="", results='hold'}
a <- try(scan("http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/14-13.pdf", what=""))
a <- try(scan("http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/14-14.pdf", what=""))
a <- try(scan("http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/14-15.pdf", what=""))
```
You should receive errors for 14-13 and 14-14, so the OIS's `id` must be 14-15. We'll fix the URL to the associated pdf file too.
```{r comment="", results='hold'}
ois$id[grepl("5100 block of Willows Avenue",ois$location) &
       ois$date=="04/22/2014"] <- "14-15"
ois$url[ois$id=="14-15"] <- "http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/14-15.pdf"
```
The problem with the remaining two addresses are the smart quotes around "A" and "B" in the street name. 
```{r comment="", results='hold'}
i <- which(ois$id=="")
ois[c((i[1]-1):(i[1]+1),(i[2]-1):(i[2]+1)), 1:3]
```


Before 2013, the names of the pdf files combined the OIS id and the location. But smart quotes can cause problems in URLs and file names. This is what likely caused a problem here. I made some guesses about what the pdf file name must be and found pdfs for ["10-65 B and Ontario St.pdf"](http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/2010/10-65%20B%20and%20Ontario%20St.pdf) and ["10-76 A and Louden St.pdf"](http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/2010/10-76%20A%20and%20Louden%20St.pdf). So, the `id` right after OIS 10-60 is 10-65 and the `id` right after OIS 10-74 is 10-76. We can clean up the locations and the URLs too.
```{r comment="", results='hold'}
ois$id[1 + which(ois$id=="10-60")] <- "10-65"
ois$id[1 + which(ois$id=="10-74")] <- "10-76"

ois$location[ois$id=="10-65"] <- "B St and Ontario St"
ois$location[ois$id=="10-76"] <- "A St and Louden St"

ois$url[ois$id=="10-65"] <- "http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/2010/10-65 B and Ontario St.pdf"
ois$url[ois$id=="10-76"] <- "http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/2010/10-76 A and Louden St.pdf"
```
There are several pdfs that don't appear on the OIS site. With a little more work I found other links to them.
```{r comment="", results='hold'}
# make a list of the OISs with missing pdfs
listPDFmissing <- c(with(ois, id[id>="14-15" & id<="14-32"]),
                    with(ois, id[id>="15-02" & id<="15-37"]))
i <- match(listPDFmissing, ois$id)
ois$url[i] <- paste0("http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/",listPDFmissing,".pdf")
# add the other two with URLs we already fixed
listPDFmissing <- c("10-76", "10-65", listPDFmissing)

# shooting 15-35 has a particularly odd URL
ois$url[ois$id=="15-35"] <- "https://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/PS15-35-summary-dj.doc.pdf"
```

Many of the locations have odd looking characters.
```{r comment="", results='hold'}
grep("&", ois$location, value=TRUE)
```
HTML represents certain special characters with special codes. For example, "&amp;" is the ampersand, "&quot;" is a quote, and "&nbsp;" is a non-breaking space, a space that always keeps the word before and after the space on the same line. HTML special characters always start with a & and end with ;. Let's clean up these HTML codes and also remove any extra leading or trailing spaces from the location.

```{r comment="", results='hold'}
ois$location <- gsub("&quot;",  "",    ois$location)
ois$location <- gsub("&amp;",   "and", ois$location)
ois$location <- gsub("^ *| *$", "",    ois$location)
```
Lastly, let's reformat the dates using the `lubridate` package.
```{r comment="", results='hold'}
library(lubridate)
ois$date <- mdy(ois$date)
```

Now our data frame `ois` should have all its IDs, with properly formatted dates, clean locations, and correct URLs indicating the location of the pdf files with details about the shootings.
```{r comment="", results='hold'}
head(ois)
```

# Extracting data from pdf files

All of the OISs in 2012 and earlier are missing the incident dates. 
```{r comment=""}
# pick the first two OISs in each year
a <- ois$id[ois$id<"13-00"]
a <- sapply(split(a, substr(a, 1, 2)), function(x) x[1:2])
subset(ois, id %in% sort(as.character(a)))[,1:3]
```
However, the pdf documents describing the incident contains the date of the incident. For example, you can read in the pdf at [https://ois.sites.phillypolice.com/12-01/](https://ois.sites.phillypolice.com/12-01/) that the shooting was on January 1, 2012. Rather than reading all the pdf files and transcribing the dates, we are going to have R do all the work.

The package `pdftools` includes functions for exploring pdf files. Let's load the library and have a look at one of the pdf files describing incident 07-01.
```{r comment="", results='hold'}
library(pdftools)

a <- scan("https://ois.phillypolice.com/07-01", what="", sep="\n")
i <- grep("\\.pdf",a)[1]
pdfFilename0701 <- gsub('.*src="([^"]*)".*',"\\1", a[i])

pdfText0701 <- pdf_text(pdfFilename0701)
pdfText0701
```
`pdf_text()` extracts all the raw text from the pdf file. Right at the beginning of the file we can see a date, 1/1/07. That's what we want. Note that there are scattered `\r` or `\n` or both throughout. These are carriage return (`\r`) and line feed (`\n`) characters that signal the end of a line. The old printers would look for these characters to move the printer head back to the beginning of a line (carriage return) and advance the page to the next line (line feed). Nowadays, those same characters are still used to denote the end of a line. However, PCs use `\r\n`, Unix systems use `\n`, older Macs used `\r`, but Mac OS X adopted the Unix standard `\n`. Expect any of these combinations in data files.

```{r comment=""}
a <- strsplit(pdfText0701, split = "\n")[[1]]
a
```
Now let's apply the `mdy()` function to `a`. Now most of the text looks nothing like dates, so for those `mdy()` will just give us an `NA`. But for the properly formatted dates, we will get a date object back.
```{r comment=""}
a <- mdy(a)
a
```
In this case, only one of the lines had a proper date, but it is possible that some document might have more than one. So we will record just the first one using `sort(a)[1]`. By default, `sort()` tosses all the `NA`s.

Let's put this all together. We will look through each OIS, read its pdf file, save the text in a new column in `ois`. OISs 2015 and earlier have their text descriptions buried in a pdf file. After 2016, the OIS description is in plain text on the web page. So first the hard part... extracting that text from the pdf files.

```{r message=TRUE}
ois$text <- NA
iOISwithPDFs <- grep("(07|08|09|10|11|12|13|14|15)-", ois$id)
for(i in iOISwithPDFs)
{
   # first check if the OIS is on the list of missing PDFs
   #    earlier we had found an alternative URL for these
   if(ois$id[i] %in% listPDFmissing)
   {
      pdfFilename <- ois$url[i]
   } else
   {
      # scan the page to find the download pdf link
      a <- scan(ois$url[i], what="", sep="\n")
      # pdfs at sitescdn before 2014, switch to ois.sites in 2014
      j <- grep("\\.pdf", a)[1]
      pdfFilename <- gsub('.*src="([^"]*)".*',"\\1", a[j])
   }
   
   a <- pdf_text(pdfFilename)
   # if more than one page, collapse all pages together
   a <- paste(a, collapse="\n")
   ois$text[i] <- a
}
```

The rest of the OISs have their data in plain text on the web pages, so we do not need to use `pdf_text()` to grab that information. Here is a for-loop that goes loops through all shootings from 2016 to 2020.
```{r comment=""}
iOISwithText <- grep("(16|17|18|19|20)-", ois$id)
for(i in iOISwithText)
{
   a <- scan(ois$url[i], what="", sep="\n")

   jStart <- grep("clearfix", a)       + 1
   jEnd <- grep("\\.entry-content", a) - 1
   
   ois$text[i] <- paste(a[jStart:jEnd], collapse="\n")
}
```

Let's check a handful of the text fields to make sure we have what we think we should have there. We will use `substring()` to avoid printing out some of the very long text descriptions.
```{r comment=""}
i <- seq(1, nrow(ois), by=10)
substring(ois$text[i], 1, 40)
```

Now our `ois` data frame has columns: `r paste(names(ois), collapse=", ")`.

Right now the row for OIS 07-01 has a missing date.
```{r comment=""}
subset(ois, id=="07-01")[,1:3]
```

For any OIS that is missing the date, extract the date from the pdf text. 
```{r comment="", warning=FALSE}
for(i in which(is.na(ois$date)))
{
   # try to find dates, take the first valid date
   a <- strsplit(ois$text[i], split = "\n")[[1]]
   a <- sort(mdy(a))[1]

   ois$date[i] <- as.character(a)
}
```
And let's check what happened to the date for OIS 07-01.
```{r comment=""}
subset(ois, id=="07-01")[,1:3]
```
And let's check that we have all valid dates
```{r comment=""}
sum(is.na(ymd(ois$date)))
```

As another check, let's make sure that all the OISs with `id` starting with 07 have dates in 2007, OISs with `id` starting with 08 have dates in 2008, and so on.
```{r comment=""}
aggregate(year(date) ~ substr(id,1,2),
          data=ois,
          FUN=unique)
```
Yes, every incident's `id` matches with the year derived from the date. This does not guarantee that all the dates were extracted correctly, but it is a good check to catch any obvious errors.

Without having to transcribe dates from pdf files, we have now filled in all the dates!

# Geocoding the OIS locations

Our OIS data frame has the address for every incident, but to be more useful we really need the geographical coordinates. If we had the coordinates, then we could put them on a map, tabulate how many incident occur within an area, calculate distances, and answer geographical questions about these data.

Geocoding is the process of converting a text description of a location (typically and address or intersection) to obtain geographic coordinates (often longitude/latitude, but other coordinate systems are also possible). Google Maps currently reigns supreme in this area. Google Maps understand very general descriptions of locations. You can ask for the coordinates of something like "bobbys burger palace near UPenn" and it will understand that "UPenn" means the University of Pennsylvania and that "bobbys burger palace" is celebrity chef Bobby Flay's fast food burger joint. Unfortunately, as of June 2018 Google Maps now requires a credit card in order to access its geocoding service. Previously, anyone could geocode up to 2,500 locations per day without needing to register. Also unfortunately, Bobby's Burger Palace [closed in 2019](https://www.thedp.com/article/2019/11/bobbys-burger-palace-closed-upenn-university-city).

These technologies are still rapidly evolving, so it is most important to learn how to learn to use these tools. We will use the OpenStreetMap geocoder and the ArcGIS geocoder to give you the sense of how to work with them.

Many web data sources use a standardized language for providing data. JSON (JavaScript Object Notation) is quite common and both OpenStreetMap and ArcGIS use JSON. 

The URL for OpenStreetMap has the form
`http://nominatim.openstreetmap.org/search/3718%20Locust%20Walk,%20Philadelphia,%20PA?format=json&addressdetails=1&limit=1`

You can see the address for Penn's McNeil Building embedded in this URL. Spaces need to be replaced with `%20` (the space character has ASCII code 20). Let's see what data we get back from this URL.

```{r comment="", results='hold'}
scan("http://nominatim.openstreetmap.org/search/3718%20Locust%20Walk,%20Philadelphia,%20PA?format=json&addressdetails=1&limit=1",
     what="", sep="\n")
```
It is messy, but readable. You can see embedded in this text the `lat` and `lon` for this address. You can also see that it should not be too hard for a machine to extract these coordinates, and the rest of the information here, from this block of text. This is the point of JSON, producing data in a format that a human could understand in a small batch, but a machine could pr
ocess fast and easily.

Fortunately, the `jsonlite` R package facilitates the conversion of JSON text like this into convenient R objects.
```{r comment="", results='hold'}
library(jsonlite)
fromJSON("http://nominatim.openstreetmap.org/search/3718%20Locust%20Walk,%20Philadelphia,%20PA?format=json&addressdetails=1&limit=1")
```
`fromJSON()` converts the results from the OpenStreetMap geocoder to a row in a data frame. The JSON tags turn into column names and the values are placed as data in a row.

The major drawback of the OpenStreetMap geocoder is that it does not handle intersections. If we try to geocode "38th St and Walnut St", the results come back empty.
```{r comment="", results='hold'}
fromJSON("http://nominatim.openstreetmap.org/search/38th%20St%20and%20Walnut%20St,%20Philadelphia,%20PA?format=json&addressdetails=1&limit=1")
```

This is where the ArcGIS geocoder comes in handy. The ArcGIS geocoder also works with JSON but the URL construction is a little different. Here we get several results, but clearly the first one is the one that we want. It is also the one that has the highest `score`.
```{r comment="", results='hold'}
fromJSON("https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/findAddressCandidates?f=json&singleLine=38th%20St%20and%20Walnut%20St,%20Philadelphia,%20PA&outFields=Match_addr,Addr_type")
```

To make geocoding using these two services a little more convenient, we can create two functions that automate the process of taking an address, filling in `%20` for spaces in the appropriate URL, and retrieving the JSON results from the geocoding service.

```{r comment="", results='hold'}
geocodeOSM <- function(address)
{
   a <- gsub(" +", "\\%20", address)
   a <- paste0("http://nominatim.openstreetmap.org/search/",
               a,
               "?format=json&addressdetails=0&limit=1")
   return( fromJSON(a) )
}

geocodeARCGIS <- function(address)
{
   a <- gsub(" +", "\\%20", address)
   a <- paste0("https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/findAddressCandidates?f=json&singleLine=",
               a,
               "&outFields=Match_addr,Addr_type")
   return( fromJSON(a) )
}
```

Let's test out `geocodeOSM()` by pulling up a up a map of the geocoded coordinates. Once we have the latitude and longitude for the McNeil Building, where we typically hold our crime data science courses at Penn, we can use `leaflet()` to show us a map of the area.
```{r comment="", message=FALSE, fig.cap="OSM geocoding result for 3718 Locust Walk"}
library(leaflet)

gcPenn <- geocodeOSM("3718 Locust Walk, Philadelphia, PA")

leaflet() |>
   addTiles() |>
   setView(lng=gcPenn$lon, lat=gcPenn$lat, zoom=18) |>
   addCircleMarkers(lng=as.numeric(gcPenn$lon), 
                    lat=as.numeric(gcPenn$lat))
```
`leaflet()` prepares the mapping process. `addTiles()` pulls in the relevant map image (buildings and streets). `setView()` takes the longitude and latitude from our `gcPenn` object, sets that as the center of the map, and zooms in to level "18," which is a fairly close zoom of about one block. `addCircleMarkers()` creates a circle at the selected point.

ThE map shows a map of the western part of the Penn campus with 3718 Locust Walk (the square building near the center of the map) near the center of the map. It is close... off by one building. Maybe the ArcGIS geocoder will do better.
```{r comment="", message=FALSE, fig.cap="ArcGIS geocoding result for 3718 Locust Walk"}
gcPenn <- geocodeARCGIS("3718 Locust Walk, Philadelphia, PA")
gcPenn <- gcPenn$candidates[1,]
gcPenn$lon <- gcPenn$location$x
gcPenn$lat <- gcPenn$location$y

leaflet() |>
   addTiles() |>
   setView(lng=gcPenn$lon, lat=gcPenn$lat, zoom=18) |>
   addCircleMarkers(lng=gcPenn$lon, 
                    lat=gcPenn$lat)
```

We are almost ready to throw all of our addresses at these geocoders, but let's first make sure the addresses look okay. Several OISs are missing locations.
```{r comment="", results='hold'}
i <- which(ois$location %in% c("","withheld","Withheld"))
ois$text[i]
```

Several of these text descriptions of the incidents contain the location information. Let's fill those in
```{r comment="", results='hold'}
ois$location[ois$id=="16-18"] <- "3200 block of Wellington Street"
ois$location[ois$id=="10-06"] <- "Howard and Grange Street"
ois$location[ois$id=="08-06"] <- "200 block of Clapier Street"
ois$location[ois$id=="08-18"] <- "900 block of E. Slocum Street"
ois$location[ois$id=="08-30"] <- "700 block of W. Rockland Street"
ois$location[ois$id=="08-40"] <- "5400 Jefferson Street"
ois$location[ois$id=="08-60"] <- "3000 Memphis Street"
ois$location[ois$id=="08-70"] <- "1300 block of S. 29th Street"
ois$location[ois$id=="08-74"] <- "5600 block of N. Mascher Street"
```
While browsing locations sometimes you might come across ones that aren't quite right, this one for example. fixing.
```{r comment="", results='hold'}
# This one was just "51st Arch"
ois$location[ois$id=="07-19"] <- "51st St and Arch"
```

The text of this incident indicates that Philadelphia PD officers were not involved in the shooting.
```{r comment="", results='hold'}
ois$text[ois$id=="17-08"]
ois <- subset(ois, id != "17-08")
```

Several of the addresses are of the form "5400 block of Erdick St". I want these to get geocoded to the middle of the block. So I'm going to change addresses like these to be like "5450 Erdick St".

```{r comment="", results='hold'}
# put "blocks" at the midpoint
ois$location <- gsub("00 block( of)?", "50", ois$location, ignore.case=TRUE)
ois$location <- gsub("unit bl(oc)?k( of)?", "50", ois$location, ignore.case=TRUE)
# and one additional cleanup "Rear Alley of 300 block of N. 55th Street"
ois$location <- gsub("Rear Alley of |near ", "", ois$location, ignore.case = TRUE)
```

Now let's run all of the addresses through the OpenStreetMap geocoder. We could have geocoded all these addresses with the more simple code `lapply(a, geocodeOSM)`. However, if the JSON connection to the OpenStreetMap website fails for even one of the addresses (likely if you have a poor internet connection), then the whole `lapply()` function fails. With the for-loop implementation, if the connection fails, then `gcOIS` still keeps all of the prior geocoding results and you can restart the for-loop at the point where it failed.
```{r comment="", results='hold'}
# add city and state to each address to improve geocoding accuracy
a <- paste0(ois$location,", Philadelphia, PA")

# create a list to store the geocoding results
gcOIS <- vector("list", nrow(ois))
for(i in 1:nrow(ois))
{
   gcOIS[[i]] <- geocodeOSM(a[i])
   if(length(gcOIS[[i]]) == 0)
   {
      cat("Could not geocode address #",i,":",a[i],"\n")
   }
}
```
Geocoding with OSM failed on numerous addresses. Some of them predictably failed as they are formatted poorly (e.g. Harbison Ave -Phila, Philadelphia, PA). Many others fail because they are intersections and OpenStreetMap does not handle intersections (yet). So let's send the addresses that OpenStreetMap could not geocode to the ArcGIS geocoder.
```{r comment="", results='hold'}
for(i in which(lengths(gcOIS)==0))
{
   gcOIS[[i]] <- geocodeARCGIS(a[i])
}
```

Now `gcOIS` has some of the geocoding results from OpenStreetMap and some from ArcGIS. The results are in different formats. The ArcGIS results have a component named `candidates`. So we'll use the presence/absence of a `candidates` component to figure out which geocoding service delivered the results. Then we can extract the longitude, latitude, and some additional features. The ArcGIS geocoder gives numerous results, but we will just take the top scoring one, which is the most likely match.
```{r comment="", results='hold'}
b <- lapply(gcOIS, function(x)
{
   if(is.null(x$candidates)) # OSM
   {
      a <- data.frame(lon=as.numeric(x$lon),
                      lat=as.numeric(x$lat),
                      score=as.numeric(x$importance),
                      loctype=paste(x$class, x$type, sep=":"),
                      method="osm",
                      addressGeo=x$display_name)
   } else # ArcGIS
   {
      a <- data.frame(lon=x$candidates$location[1,"x"],
                      lat=x$candidates$location[1,"y"],
                      score=x$candidates$score[1],
                      loctype=x$candidates$attributes$Addr_type[1],
                      method="arcgis",
                      addressGeo=x$candidates$attributes$Match_addr[1])
   }
   return(a)
})
gcOIS <- do.call(rbind, b)
# add a column containing the original address
gcOIS <- cbind(gcOIS, addressOrig = a)

# add an ID column
gcOIS <- cbind(id=ois$id, gcOIS)

head(gcOIS)
```
Now it appears that we have longitude and latitude for every incident. We should check that they all look sensible.
```{r comment="", results='hold'}
stem(gcOIS$lat)
stem(gcOIS$lon)
```
All the points have latitude around 39 and 40 and longitude around -75.

There's no perfect way to check the geocoding results aside from checking each individual point and verify with a map that it lands on the right spot. We can do things like check for strange outliers. We can spot check a few locations with a map.

```{r comment="", message=FALSE, fig.cap=gcOIS$addressOrig[i]}
i <- which(ois$id=="18-01")
   
leaflet() |>
addTiles() |>
setView(lng=gcOIS$lon[i], lat=gcOIS$lat[i], zoom=18) |>
addCircleMarkers(lng=gcOIS$lon[i], lat=gcOIS$lat[i]) |>
addLabelOnlyMarkers(lng=gcOIS$lon[i], lat=gcOIS$lat[i], 
                    label =  gcOIS$addressOrig[i], 
                    labelOptions = labelOptions(noHide = TRUE, 
                                                direction = 'top',
                                                textOnly = TRUE))

```

```{r comment="", message=FALSE, fig.cap=gcOIS$addressOrig[i]}
i <- which(ois$id=="13-14")
   
leaflet() |>
addTiles() |>
setView(lng=gcOIS$lon[i], lat=gcOIS$lat[i], zoom=18) |>
addCircleMarkers(lng=gcOIS$lon[i], lat=gcOIS$lat[i]) |>
addLabelOnlyMarkers(lng=gcOIS$lon[i], lat=gcOIS$lat[i], 
                    label =  gcOIS$addressOrig[i], 
                    labelOptions = labelOptions(noHide = TRUE, 
                                                direction = 'top',
                                                textOnly = TRUE))
```

```{r comment="", message=FALSE, fig.cap=gcOIS$addressOrig[i]}
i <- which(ois$id=="07-13")
   
leaflet() |>
addTiles() |>
setView(lng=gcOIS$lon[i], lat=gcOIS$lat[i], zoom=18) |>
addCircleMarkers(lng=gcOIS$lon[i], lat=gcOIS$lat[i]) |>
addLabelOnlyMarkers(lng=gcOIS$lon[i], lat=gcOIS$lat[i], 
                    label =  gcOIS$addressOrig[i], 
                    labelOptions = labelOptions(noHide = TRUE, 
                                                direction = 'top',
                                                textOnly = TRUE))
```


The first these was geocoded correctly. I looked up 2850 Kensington in Google Maps to verify its location. You can see Arrott St and Frankford St, but the dot is not at their intersection. The address shown in the map title is missing an "and" between "Arrott Street" and "Frankford Avenue". So where did geocoding place our point?

```{r comment="", results='hold'}
subset(gcOIS, id %in% c("18-01","13-14","07-13"))
```
Note that the `loctype` for the first address is "place:house" and `loctype` for the other two are "railway:station" and "landuse:industrial". "place:house" signals that the geocoding was accurate to a very specific location. But the others are problematic. OpenStreetMap did not understand that we were looking for intersections and instead geocoded to a transit center and an industrial location.

Ideally we want geocoding to get us to a house, intersection, or address. If `loctype` is residential, locality, city, or streetname then this indicates that the geocoding did not get us to a very specific  The 13th and Pike Streets geocoding seemed to fail completely.
```{r comment="", results='hold'}
sort(table(gcOIS$loctype))
```
Several of these are very specific locations (office, shop, building, station, house, PointAddress, StreetAddress, StreetInt). Many others are not specific at all (highway, neighborhood, city, Locality, StreetName). Each of these needs to be revisited. Let's examine one these here. Let's look at those geocoded down to `loctype="Streetname"`.
```{r comment="", results='hold'}
subset(gcOIS, loctype=="StreetName")
```
We can see that the problem for many of these is simply bad addresses. Some are intersections that are missing the "and" between the two streets. Let's read into the incident at Harbison Ave to see if we can learn more.
```{r comment="", results='hold'}
ois$text[ois$location=="Harbison Ave -Phila"]
```
The text describes the path of the suspect moving south on Torresdale and turning west on Bridge. Let's pull up a map of this area. I've overlayed the path of the suspect on top of the map. The officers describe the suspect moving "south" on Torresdale and "west" on Bridge because the I95 and the Delaware River are just south of this area and run more east-west here. So thinking that I95 run north-south and that the Delaware River is the east boundary of Philadelphia, officers may decribe someone running "west" on Bridge St as running away from I95 and the Delaware River.

```{r comment="", results='hold', message=FALSE, fig.cap="Path of individual involved in OIS 12-20"}
leaflet() |>
  addTiles() |>
  setView(lng=-75.0691987, lat=40.0138755, zoom=18) |>
  addPolylines(lng=c(-75.068621,-75.070423,-75.070594),
               lat=c( 40.013570, 40.013184, 40.013710),
               color = "red")
```
It took some investigation, but we can fix this address too. Here's how we can fix the two addresses that we've identified fixes for so far (The rest are left as an exercise). `textConnection()` is a nice trick for making a small data frame right inside a script. I'm going to make two columns, one with the original address and one with the correction or improved address.
```{r comment="",  message=FALSE}
a <- textConnection(
"addressOrig,addressFix
Arrott Street Frankford Avenue,Arrott Street and Frankford Avenue
13th and Pike Streets,13th and Pike St
Harbison Ave -Phila,Bridge St and Ditman St")
a <- read.csv(a)
# make sure the original addresses match, no NAs!
i <- match(a$addressOrig, ois$location)
i

b <- lapply(paste0(a$addressFix,", Philadelphia, PA"), geocodeARCGIS)
b <- lapply(b, function(x)
{
   data.frame(lon=x$candidates$location[1,"x"],
              lat=x$candidates$location[1,"y"],
              score=x$candidates$score[1],
              loctype=x$candidates$attributes$Addr_type[1],
              method="arcgis",
              addressGeo=x$candidates$attributes$Match_addr[1],
              stringsAsFactors = FALSE)
})
b <- do.call(rbind, b)
gcOIS[i,names(b)] <- b
gcOIS[i,]
```
Both appear to be fixed. `addressGeo` appears correct and the `loctype` is now `StreetInt`. All good signs.

Let's create new longitude/latitude columns in our `ois` data frame so that one object contains all of our essential data.
```{r comment="", results='hold', message=FALSE}
ois$lon <- gcOIS$lon
ois$lat <- gcOIS$lat
```

We'll proceed as if we're satisfied with our geocoding even though you know you have more work to do to fix some of those non-specific geocoding problems. We close this section with a map of the city of Philadelphia and the locations of all officer involved shootings.

```{r comment="", results='hold', message=FALSE, fig.cap="All Philadelphia Officer-involved Shootings", cache=FALSE}
leaflet() |>
   addTiles() |>
   fitBounds(-75.288486,39.868285,-74.950965,40.138251) |>
   addCircles(lng=gcOIS$lon, lat=gcOIS$lat, color="red",
              popup = paste(gcOIS$id, gcOIS$addressOrig))
```

# Working with shapefiles and coordinate systems

The Philadelphia Police Department divides the city into Police Service Areas (PSAs). The city provides a *shapefile*, a file containing geographic data, that describes the boundaries of the PSAs at Philadelphia's [open data site](https://www.opendataphilly.org/dataset/police-service-areas). R can read these files using the `st_read()` function provided in the `sf` (simple features) package. Even though `st_read()` appears to only be accessing `Boundaries_PSA.shp`, you should have all of the `Boundaries_PSA` files in your `10_shapefiles_and_data` folder. The other files have information that `st_read()` needs, like the coordinate system stored in `Boundaries_PSA.prj`. If you do not have all `Boundaries_PSA` files in your folder, then in a few lines you will get errors like "the sfc object should have crs set," meaning that the Coordinate Reference System (CRS) is missing.
```{r comment="", results='hold', message=FALSE}
library(sf)
PPDmap <- st_read("10_shapefiles_and_data/Boundaries_PSA.shp")
```
`PPDmap` is an `sf` (simple features) object. It is not unlike a data frame, but it can contain a column containing geographic information associated with a row of other data. Here are the two columns in `PPDmap` that are of primary interest.
```{r comment="", results='hold', message=FALSE}
PPDmap[,c("PSA_NUM","geometry")]
```
The first column shows the PSA number and the second column shows a truncated description of the geometry associated with this row. In this case, `geometry` contains the coordinates of the boundary of the PSA for each row. Use `st_geometry()` to extract the polygons to make a plot.
```{r comment="", results='hold', message=FALSE}
plot(st_geometry(PPDmap))
axis(side=1) # add x-axis
axis(side=2) # add y-axis
# extra the center points of each PSA
a <- st_coordinates(st_centroid(st_geometry(PPDmap)))
# add the PSA number to the plot
text(a[,1], a[,2], PPDmap$PSA_NUM, cex=0.5)
```

We can extra the actual coordinates of one of the polygons if we wish.
```{r comment="", message=FALSE}
a <- st_coordinates(PPDmap$geometry[1])
head(a)
tail(a)
```
And we can use those coordinates to add additional features to our plot
```{r comment="", message=FALSE}
plot(st_geometry(PPDmap))
axis(side=1)
axis(side=2)
a <- st_coordinates(st_centroid(st_geometry(PPDmap)))
text(a[,1], a[,2], PPDmap$PSA_NUM, cex=0.5)
a <- st_coordinates(PPDmap$geometry[1])
lines(a[,1], a[,2], col="red", lwd=3)
```
So this highlighted in red PSA 77 in the southern end of Philadelphia. Note that R issued some warnings about our centroid locations. We will return to that in a moment.

Rather than extracting coordinates to add a feature to a plot, `subset()` provides an easier method.
```{r comment="", message=FALSE}
plot(st_geometry(PPDmap))
axis(side=1)
axis(side=2)
a <- st_coordinates(st_centroid(st_geometry(PPDmap)))
text(a[,1], a[,2], PPDmap$PSA_NUM, cex=0.5)
plot(st_geometry(subset(PPDmap, PSA_NUM=="077")),
     add=TRUE, border="red")
plot(st_geometry(subset(PPDmap, PSA_NUM=="183")),
     add=TRUE, border="green")
```
Setting `add=TRUE` in the last two calls to `plot()` asks R to overlay the current plot with these additional objects.

Now, back to those warnings we received about calculating centroids with longitude/latitude data. Geographic datasets that describe locations on the surface of the earth have a "coordinate reference system" (CRS). Let's extract the CRS for `PPDmap`.
```{r comment="", results='hold', message=FALSE}
st_crs(PPDmap)
```
The the coordinate system used to describe the PPD boundaries is the World Geodetic System 1984 (WGS84) maintained by the United States National Geospatial-Intelligence Agency, one of several standards to aid in navigation and geography. The European Petroleum Survey Group (EPSG) maintains a catalog of different coordinate systems (should be no surprise that oil exploration has driven the development of high quality geolocation standards). They have assigned the standard longitude/latitude coordinate system to be [EPSG4326]((http://spatialreference.org/ref/epsg/4326/). You can find the full collection of coordinate systems at [spatialreference.org](http://spatialreference.org/ref/epsg/). You can see in the output above a reference to EPSG 4326.

Many of us are comfortable with the longitude/latitude angular coordinate systems. However, the distance covered by a degree of longitude shrinks as you move towards the poles and only equals the distance covered by a degree of latitude at the equator. In addition, the earth is not very spherical so the coordinate system used for computing distances on the earth surface might need to depend on where you are on the earth surface. 

Almost all web mapping tools (Google Maps, ESRI, OpenStreetMaps) use the pseudo-Mercator projection ([EPSG3857](http://spatialreference.org/ref/epsg/3857/)). Let's convert our PPD map to that coordinate system.
```{r comment="", results='hold', message=FALSE}
PPDmap <- st_transform(PPDmap, crs=3857)
st_crs(PPDmap)
```
The CRS now indicates that this is a Mercator projection with distance measured in meters (`UNIT["Meter",1]`). Now if we ask for the centroids of the PSAs, we get more accurate centroids and no warnings from R.
```{r comment="", results='hold', message=FALSE}
st_centroid(st_geometry(PPDmap))
```
We can use the same projection, but modify it so that distances are measured in feet. You should see that the `UNIT[]` has changed to Foot_US.
```{r comment="", results='hold', message=FALSE}
PPDmap <- st_transform(PPDmap, crs=st_crs("+init=epsg:3857 +units=us-ft"))
st_crs(PPDmap)
```
There is a special coordinate system for every part of the world. A useful coordinate system for the Philadelphia area is [EPSG2272](http://spatialreference.org/ref/epsg/2272/). Let's convert our PPD map to that coordinate system.
```{r comment="", results='hold', message=FALSE}
PPDmap <- st_transform(PPDmap, crs=2272)
st_crs(PPDmap)
```
This coordinate system is the Lambert Conic Conformal (LCC). This particular projection of the `PPDmap` is tuned to provide good precision for the southern part of Pennsylvania (note the parallel coordinates are at the latitude of southern Pennsylvania and the meridian is a little west of Philadelphia) and distances are measured in feet (note the `UNIT` tag in the CRS description).

Let's transform back to longitude/latitude. It really is best to work using a different coordinate system, but I'm going to stick with longitude/latitude so that the values make a little more sense to us. Also at the scale of Philadelphia, we're just using the centroid calculation to figure out where to put labels. `st_transform()` does allow you to just use the EPSG code by itself.
```{r comment="", results='hold', message=FALSE}
PPDmap <- st_transform(PPDmap, crs=4326)
```
Now both PPD data and polygons are on the same scale
```{r comment="", results='hold', message=FALSE}
plot(st_geometry(PPDmap), axes=TRUE)
points(lat~lon, data=gcOIS, col=rgb(1,0,0,0.5), pch=16)
```
To make the dots a little transparent, I've used the `rgb()` function with which you can mix red, green, and blue colors and set the transparency. The `1` tells `rgb()` to use maximum red. The two `0`s tell `rgb()` to use no green or blue. The 0.5 tells `rgb()` to make the dots halfway transparent.

# Spatial joins

Spatial joins is the process of linking two data sources by their geography. For the case of the OIS data, we want to know how many OISs occurred in each PSA. To do this we need to drop each OIS point location into the PSA polygons and have R tell us in which polygon did each OIS land.

First we need to convert our `ois` data frame to an `sf` object, communicating to R that the `lon` and `lat` columns are special. At this stage we also have to communicate in what coordinate system are the `lon` and `lat` values. `st_as_sf()` converts an R object into an `sf` object.
```{r comment="", results='hold', message=FALSE}
ois <- st_as_sf(ois, 
                coords=c("lon","lat"),
                crs=4326)
# check that everything looks okay
plot(st_geometry(PPDmap), axes=TRUE)
plot(st_geometry(ois), add=TRUE, col=rgb(1,0,0,0.5), pch=16)
```
`st_join()` will match each row in `ois` to each polygon in PSA. I just want the `PSA_NUM` column out of the `PPDmap`.
```{r comment="", results='hold', message=FALSE}
PSAlookup <- st_join(ois, PPDmap[,"PSA_NUM"])
PSAlookup[1:3, c("id","date","location","PSA_NUM","geometry")]
```
Now our `PSAlookup` contains everything from `ois` but also adds a new column `PSA_NUM`.

Let's examine the PSAs with the most OISs and highlight them on the map.

```{r comment="", message=FALSE}
a <- rev(sort(table(PSAlookup$PSA)))
a

plot(st_geometry(PPDmap), axes=TRUE)
# plot the OISs in the PSA with the most OISs
i <- which(PSAlookup$PSA_NUM==names(a)[1])
plot(st_geometry(PSAlookup[i,]), add=TRUE, col=rgb(0,1,0,0.5), pch=16)
# plot the OISs in the PSA with the second most OISs
i <- which(PSAlookup$PSA_NUM==names(a)[2])
plot(st_geometry(PSAlookup[i,]), add=TRUE, col=rgb(0,0,1,0.5), pch=16)
```

Let's identify which OISs occurred in the same PSA as the University of Pennsylvania. We've already geocoded Penn and have its coordinates.
```{r comment="", message=FALSE}

gcPenn
#  map them to the spatial coordinates
st_join(st_as_sf(gcPenn,
                 coords=c("lon","lat"),
                 crs=4326),
        PPDmap)
```
Now we see that Penn is in PSA 183 and we can highlight those points on the map.
```{r comment="", message=FALSE}
plot(st_geometry(PPDmap), axes=TRUE)
i <- which(PSAlookup$PSA_NUM=="183")
plot(st_geometry(PSAlookup[i,]), add=TRUE, col=rgb(1,0,0,0.5), pch=16)
```

Lastly, we will tabulate the number of OISs in each PSA and color the map by the number of OISs.
```{r comment="", message=FALSE}
# how many shootings in each PSA?
a <- table(PSAlookup$PSA_NUM)
a

# merge the shooting count into the PPDmap data
i <- match(PPDmap$PSA_NUM, names(a))
PPDmap$nShoot <- a[i]
PPDmap[1:3,]
```
We can see that `PPDmap` now has a new `nShoot` column. A histogram will show what kinds of counts we observe in the PSAs.
```{r comment="", message=FALSE}
hist(a, xlab="Number of OISs", ylab="Number of PSAs", main="")
```

Let's discretize the OIS counts into a few categories.
```{r comment="", message=FALSE}
a <- cut(PPDmap$nShoot,
         breaks=c(0,1,5,10,15,20,25,30),
         right=FALSE)
a
```
`cut()` converts all of the individual counts into categories, like [1,5) or [25,30). For each of these categories we will associate a color for the map. `heat.colors()` will generate a sequence of colors in the yellow, orange, red range.
```{r comment="", message=FALSE}
col <- rev(heat.colors(7,1))
col
```
These are eight digit codes describing the color. The first two digits correspond to red, digits three and four correspond to green, digits five and six correspond to blue, and the last two digits correspond to transparency. These are hexadecimal numbers (base 16). Hexadecimal numbers use the digits 0-9, like normal decimal system numbers, and then denote 10 as A, 11 as B, on up to 15 as F. So FF as a decimal is $15 \times 16 + 15 = 255$, which is the maximum value for a two digit hexadecimal. The hexadecimal 80 as a decimal is $8 \times 16 + 0 = 128$, which is in the middle of the range 0 to 255. So the first color code, FFFF80FF, means maximum red, maximum green, half blue, and not transparent at all. This mixture is known more commonly as "yellow".

Now we need to select the right color for each PSA. If I apply `as.numeric()` to `a`, then all the [0,1) will convert to 1, the [1,5) will convert to 2, and so on up to [25,30) converting to 7. So `col[as.numeric(a)]` will pick out the right color for each PSA. Now create the map coloring each PSA using the right color.
```{r comment="", message=FALSE}
plot(st_geometry(PPDmap), col=col[as.numeric(a)], border=NA)
# add the number of shootings to the map
a <- st_coordinates(st_centroid(PPDmap))
text(a[,1], a[,2], PPDmap$nShoot, cex=0.5)
```
Those PSAs with the least shootings are a very pale yellow. As we examine PSAs with a greater number of OISs, their colors get redder and redder.

# Summary
We started with just a web page linking to a collection of pdf files. We used regular expressions to extract everything we could from the web page tables. We had R "read" the pdf files to extract the dates that were not readily available. We geocoded the stops so that we could put them on a map. Finally, we could tabulate by PSA the number of OISs and map those as well.

If you've worked through all of this, then I would recommend that you save your objects, using `save(ois, PSAlookup, gcOIS, file="PPDOIS.RData")`. That way you will not have to scrape everything off the web again or redo any geocoding.

# Exercises
`r .exNum('Revisit the geocoding section discussing geocoding errors. Examine the OISs that have not been geocoded to specific locations. Fix their addresses and redo the geocoding of these OISs to improve the accuracy of the data.')`

`r .exNum('Identify officer-involved shootings that resulted in the offender being transported to the Hospital at the University of Pennsylvania. Create a map marking the location of HUP, the location of officer-involved shootings resulting in the offender being transported to HUP, and the locations of all other shootings.')`

`r .exNum('For each shooting determine which hospital treated the offender. Use @@st_distance()@@ to determine what percentage of those shot in an OIS went to the closest hospital.')`