-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path10_PPD_shootings_extracting_from_text_geocoding.Rmd
832 lines (717 loc) · 46.4 KB
/
10_PPD_shootings_extracting_from_text_geocoding.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
---
title: "Extracting data from text and geocoding"
author:
- affiliation: University of Pennsylvania
email: [email protected]
name: Greg Ridgeway
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
css: htmlstyle.css
editor_options:
chunk_output_type: console
---
<!-- HTML YAML header Ctrl-Shift-C to comment/uncomment -->
<!-- --- -->
<!-- title: "Extracting data from text and geocoding" -->
<!-- author: -->
<!-- - Greg Ridgeway ([email protected]) -->
<!-- date: "`r format(Sys.time(), '%B %d, %Y')`" -->
<!-- output: -->
<!-- pdf_document: -->
<!-- latex_engine: pdflatex -->
<!-- html_document: default -->
<!-- fontsize: 11pt -->
<!-- fontfamily: mathpazo -->
<!-- --- -->
<!-- PDF YAML header Ctrl-Shift-C to comment/uncomment -->
<!-- Make RMarkdown cache the results -->
```{r echo=FALSE, cache=FALSE}
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, cache.lazy=FALSE, out.width='100%')
library(leaflet)
```
<!-- A function for automating the numbering and wording of the exercise questions -->
```{r echo=FALSE}
.counterExercise <- 0
.exerciseQuestions <- NULL
.exNum <- function(.questionText="")
{
.counterExercise <<- .counterExercise+1
.questionText <- gsub("@@", "`", .questionText)
.exerciseQuestions <<- c(.exerciseQuestions, .questionText)
return(paste0(.counterExercise,". ",.questionText))
}
.exQ <- function(i)
{
return( paste0(i,". ",.exerciseQuestions[i]) )
}
```
# Introduction
In this section, we are going to explore officer-involved shootings (OIS) in Philadelphia. The Philadelphia Police Department posts a lot of information about officer-involved shootings online going back to 2007. Have a look at their [OIS webpage](http://www.phillypolice.com/ois/). While a lot of information has been posted to the webpage, more information is buried in pdf files associated with each of the incidents. In order for us to explore these data, we are going to scrape the basic information from the webpage, have R dig into the pdf files for any incidents missing incident dates, clean up addresses using regular expressions, geocode the addresses to latitude/longitude using the OpenStreetMap geocoder and the ArcGIS geocoder (using JSON), and then make maps describing the shootings.
# Scraping the OIS data
Let's start by grabbing the raw HTML from the PPD OIS webpage
```{r comment="", results='hold'}
a <- scan("http://www.phillypolice.com/ois/", what="", sep="\n")
```
`scan()` is a very simple function that just pulls in text from a file or URL. It does not attempt to do any formatting. `what=""` tells `scan()` to treat what it is reading in as text and `sep="\n"` tells `scan()` to break the text apart whenever it encounters a line feed character.
The first several elements of `a` are just HTML code setting up the page.
```{r comment="", results='hold'}
a[1:4]
```
But further on down you will find HTML code containing the OIS information that we seek. Let's look at one of the 2018 OISs.
```{r comment="", results='hold'}
i <- grep("id=\"2018-2954", a)
a[i + 0:9]
```
For the data from 2013-2020, each table row related to an OIS starts with something like `<tr id="2018-2954">`. The very next row contains the URL of the pdf containing more detailed data. The third row contains the date and the fourth row contains the address. There are additional cells indicating injuries and how the shooting was adjudicated, but we will not work with these in this exercise.
Note that if we want the date, it is always two elements after the `<tr>` tag. We can use `gsub()` to strip away the unwanted HTML tags and whitespace.
```{r comment="", results='hold'}
a[i + 2]
gsub("[[:space:]]|<[^>]*>","",a[i + 2])
```
We'll use this same strategy for all of the shootings and for the OISs id, date, location, and URL. Start by using `grep()` to find all of the lines of HTML code that start off a row for an OIS form 2013-2018 (the data for OISs before 2013 look a little different).
```{r comment="", results='hold'}
i <- grep("id=\"20(13|14|15|16|17|18|19|20)",a)
ois <- data.frame(
id =gsub("<[^>]*>", "", a[i+1]),
date =gsub("<[^>]*>", "", a[i+2]),
location=gsub("[[:space:]]*<[^>]*>", "", a[i+3]),
url =gsub("[[:space:]]*<td><a href=\"(.*)\" class=.*", "\\1", a[i+1]))
```
For shootings between 2007-2012, the table provides no incident date and the incident location is 2 elements after the `<tr>` tag rather than 3.
```{r comment="", results='hold'}
i <- grep("id=\"20(07|08|09|10|11|12)",a)
temp <- data.frame(
id =gsub("<[^>]*>","",a[i+1]),
date =NA,
location=gsub("<[^>]*>","",a[i+2]),
url =gsub("<td><a href=\"(.*)\" class=.*","\\1",a[i+1]))
```
Now we can stack all the data from 2007-2020 together and clear out some extra spaces and tabs in the `id` column.
```{r comment="", results='hold'}
ois <- rbind(ois, temp)
ois$id<- gsub("[[:space:]]*","",ois$id)
ois$date<- gsub("[[:space:]]*","",ois$date)
ois[1:5,]
```
Everything from the PPD OIS page is now neatly stored in an R data frame.
A couple of the entries have problematic `id`, some easily fixed.
```{r comment="", results='hold'}
grep("^[^-]*$", ois$id, value=TRUE)
ois$id[ois$id=="1630"] <- "16-30"
ois$id[ois$id=="1822"] <- "18-22"
ois$id[ois$id=="1911"] <- "19-11"
ois$id[ois$id=="1914"] <- "19-14"
```
While we are here, we also need to fix the URL for these OISs since they have a space instead of a hyphen in them.
```{r comment="", results='hold'}
ois$url[ois$id=="16-30"] <- "https://ois.phillypolice.com/16-30"
ois$url[ois$id=="18-22"] <- "https://ois.phillypolice.com/18-22"
ois$url[ois$id=="19-11"] <- "https://ois.phillypolice.com/19-11"
ois$url[ois$id=="19-14"] <- "https://ois.phillypolice.com/19-14"
```
A few other OISs with missing `id` are a little harder to figure out.
```{r comment="", results='hold'}
subset(ois, id=="")
```
The Willows Avenue OIS has a missing `id`, but the one before it has `id` 14-12 and the one after it has `id` 14-16.
```{r comment="", results='hold'}
i <- grep("5100 block of Willows Avenue", ois$location)
ois[(i-1):(i+1),]
```
Let's see if there is a pdf for 14-13, 14-14, or 14-15.
```{r comment="", results='hold'}
a <- try(scan("http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/14-13.pdf", what=""))
a <- try(scan("http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/14-14.pdf", what=""))
a <- try(scan("http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/14-15.pdf", what=""))
```
You should receive errors for 14-13 and 14-14, so the OIS's `id` must be 14-15. We'll fix the URL to the associated pdf file too.
```{r comment="", results='hold'}
ois$id[grepl("5100 block of Willows Avenue",ois$location) &
ois$date=="04/22/2014"] <- "14-15"
ois$url[ois$id=="14-15"] <- "http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/14-15.pdf"
```
The problem with the remaining two addresses are the smart quotes around "A" and "B" in the street name.
```{r comment="", results='hold'}
i <- which(ois$id=="")
ois[c((i[1]-1):(i[1]+1),(i[2]-1):(i[2]+1)), 1:3]
```
Before 2013, the names of the pdf files combined the OIS id and the location. But smart quotes can cause problems in URLs and file names. This is what likely caused a problem here. I made some guesses about what the pdf file name must be and found pdfs for ["10-65 B and Ontario St.pdf"](http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/2010/10-65%20B%20and%20Ontario%20St.pdf) and ["10-76 A and Louden St.pdf"](http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/2010/10-76%20A%20and%20Louden%20St.pdf). So, the `id` right after OIS 10-60 is 10-65 and the `id` right after OIS 10-74 is 10-76. We can clean up the locations and the URLs too.
```{r comment="", results='hold'}
ois$id[1 + which(ois$id=="10-60")] <- "10-65"
ois$id[1 + which(ois$id=="10-74")] <- "10-76"
ois$location[ois$id=="10-65"] <- "B St and Ontario St"
ois$location[ois$id=="10-76"] <- "A St and Louden St"
ois$url[ois$id=="10-65"] <- "http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/2010/10-65 B and Ontario St.pdf"
ois$url[ois$id=="10-76"] <- "http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/2010/10-76 A and Louden St.pdf"
```
There are several pdfs that don't appear on the OIS site. With a little more work I found other links to them.
```{r comment="", results='hold'}
# make a list of the OISs with missing pdfs
listPDFmissing <- c(with(ois, id[id>="14-15" & id<="14-32"]),
with(ois, id[id>="15-02" & id<="15-37"]))
i <- match(listPDFmissing, ois$id)
ois$url[i] <- paste0("http://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/",listPDFmissing,".pdf")
# add the other two with URLs we already fixed
listPDFmissing <- c("10-76", "10-65", listPDFmissing)
# shooting 15-35 has a particularly odd URL
ois$url[ois$id=="15-35"] <- "https://www.phillypolice.com/assets/crime-maps-stats/officer-involved-shootings/PS15-35-summary-dj.doc.pdf"
```
Many of the locations have odd looking characters.
```{r comment="", results='hold'}
grep("&", ois$location, value=TRUE)
```
HTML represents certain special characters with special codes. For example, "&" is the ampersand, """ is a quote, and " " is a non-breaking space, a space that always keeps the word before and after the space on the same line. HTML special characters always start with a & and end with ;. Let's clean up these HTML codes and also remove any extra leading or trailing spaces from the location.
```{r comment="", results='hold'}
ois$location <- gsub(""", "", ois$location)
ois$location <- gsub("&", "and", ois$location)
ois$location <- gsub("^ *| *$", "", ois$location)
```
Lastly, let's reformat the dates using the `lubridate` package.
```{r comment="", results='hold'}
library(lubridate)
ois$date <- mdy(ois$date)
```
Now our data frame `ois` should have all its IDs, with properly formatted dates, clean locations, and correct URLs indicating the location of the pdf files with details about the shootings.
```{r comment="", results='hold'}
head(ois)
```
# Extracting data from pdf files
All of the OISs in 2012 and earlier are missing the incident dates.
```{r comment=""}
# pick the first two OISs in each year
a <- ois$id[ois$id<"13-00"]
a <- sapply(split(a, substr(a, 1, 2)), function(x) x[1:2])
subset(ois, id %in% sort(as.character(a)))[,1:3]
```
However, the pdf documents describing the incident contains the date of the incident. For example, you can read in the pdf at [https://ois.sites.phillypolice.com/12-01/](https://ois.sites.phillypolice.com/12-01/) that the shooting was on January 1, 2012. Rather than reading all the pdf files and transcribing the dates, we are going to have R do all the work.
The package `pdftools` includes functions for exploring pdf files. Let's load the library and have a look at one of the pdf files describing incident 07-01.
```{r comment="", results='hold'}
library(pdftools)
a <- scan("https://ois.phillypolice.com/07-01", what="", sep="\n")
i <- grep("\\.pdf",a)[1]
pdfFilename0701 <- gsub('.*src="([^"]*)".*',"\\1", a[i])
pdfText0701 <- pdf_text(pdfFilename0701)
pdfText0701
```
`pdf_text()` extracts all the raw text from the pdf file. Right at the beginning of the file we can see a date, 1/1/07. That's what we want. Note that there are scattered `\r` or `\n` or both throughout. These are carriage return (`\r`) and line feed (`\n`) characters that signal the end of a line. The old printers would look for these characters to move the printer head back to the beginning of a line (carriage return) and advance the page to the next line (line feed). Nowadays, those same characters are still used to denote the end of a line. However, PCs use `\r\n`, Unix systems use `\n`, older Macs used `\r`, but Mac OS X adopted the Unix standard `\n`. Expect any of these combinations in data files.
```{r comment=""}
a <- strsplit(pdfText0701, split = "\n")[[1]]
a
```
Now let's apply the `mdy()` function to `a`. Now most of the text looks nothing like dates, so for those `mdy()` will just give us an `NA`. But for the properly formatted dates, we will get a date object back.
```{r comment=""}
a <- mdy(a)
a
```
In this case, only one of the lines had a proper date, but it is possible that some document might have more than one. So we will record just the first one using `sort(a)[1]`. By default, `sort()` tosses all the `NA`s.
Let's put this all together. We will look through each OIS, read its pdf file, save the text in a new column in `ois`. OISs 2015 and earlier have their text descriptions buried in a pdf file. After 2016, the OIS description is in plain text on the web page. So first the hard part... extracting that text from the pdf files.
```{r message=TRUE}
ois$text <- NA
iOISwithPDFs <- grep("(07|08|09|10|11|12|13|14|15)-", ois$id)
for(i in iOISwithPDFs)
{
# first check if the OIS is on the list of missing PDFs
# earlier we had found an alternative URL for these
if(ois$id[i] %in% listPDFmissing)
{
pdfFilename <- ois$url[i]
} else
{
# scan the page to find the download pdf link
a <- scan(ois$url[i], what="", sep="\n")
# pdfs at sitescdn before 2014, switch to ois.sites in 2014
j <- grep("\\.pdf", a)[1]
pdfFilename <- gsub('.*src="([^"]*)".*',"\\1", a[j])
}
a <- pdf_text(pdfFilename)
# if more than one page, collapse all pages together
a <- paste(a, collapse="\n")
ois$text[i] <- a
}
```
The rest of the OISs have their data in plain text on the web pages, so we do not need to use `pdf_text()` to grab that information. Here is a for-loop that goes loops through all shootings from 2016 to 2020.
```{r comment=""}
iOISwithText <- grep("(16|17|18|19|20)-", ois$id)
for(i in iOISwithText)
{
a <- scan(ois$url[i], what="", sep="\n")
jStart <- grep("clearfix", a) + 1
jEnd <- grep("\\.entry-content", a) - 1
ois$text[i] <- paste(a[jStart:jEnd], collapse="\n")
}
```
Let's check a handful of the text fields to make sure we have what we think we should have there. We will use `substring()` to avoid printing out some of the very long text descriptions.
```{r comment=""}
i <- seq(1, nrow(ois), by=10)
substring(ois$text[i], 1, 40)
```
Now our `ois` data frame has columns: `r paste(names(ois), collapse=", ")`.
Right now the row for OIS 07-01 has a missing date.
```{r comment=""}
subset(ois, id=="07-01")[,1:3]
```
For any OIS that is missing the date, extract the date from the pdf text.
```{r comment="", warning=FALSE}
for(i in which(is.na(ois$date)))
{
# try to find dates, take the first valid date
a <- strsplit(ois$text[i], split = "\n")[[1]]
a <- sort(mdy(a))[1]
ois$date[i] <- as.character(a)
}
```
And let's check what happened to the date for OIS 07-01.
```{r comment=""}
subset(ois, id=="07-01")[,1:3]
```
And let's check that we have all valid dates
```{r comment=""}
sum(is.na(ymd(ois$date)))
```
As another check, let's make sure that all the OISs with `id` starting with 07 have dates in 2007, OISs with `id` starting with 08 have dates in 2008, and so on.
```{r comment=""}
aggregate(year(date) ~ substr(id,1,2),
data=ois,
FUN=unique)
```
Yes, every incident's `id` matches with the year derived from the date. This does not guarantee that all the dates were extracted correctly, but it is a good check to catch any obvious errors.
Without having to transcribe dates from pdf files, we have now filled in all the dates!
# Geocoding the OIS locations
Our OIS data frame has the address for every incident, but to be more useful we really need the geographical coordinates. If we had the coordinates, then we could put them on a map, tabulate how many incident occur within an area, calculate distances, and answer geographical questions about these data.
Geocoding is the process of converting a text description of a location (typically and address or intersection) to obtain geographic coordinates (often longitude/latitude, but other coordinate systems are also possible). Google Maps currently reigns supreme in this area. Google Maps understand very general descriptions of locations. You can ask for the coordinates of something like "bobbys burger palace near UPenn" and it will understand that "UPenn" means the University of Pennsylvania and that "bobbys burger palace" is celebrity chef Bobby Flay's fast food burger joint. Unfortunately, as of June 2018 Google Maps now requires a credit card in order to access its geocoding service. Previously, anyone could geocode up to 2,500 locations per day without needing to register. Also unfortunately, Bobby's Burger Palace [closed in 2019](https://www.thedp.com/article/2019/11/bobbys-burger-palace-closed-upenn-university-city).
These technologies are still rapidly evolving, so it is most important to learn how to learn to use these tools. We will use the OpenStreetMap geocoder and the ArcGIS geocoder to give you the sense of how to work with them.
Many web data sources use a standardized language for providing data. JSON (JavaScript Object Notation) is quite common and both OpenStreetMap and ArcGIS use JSON.
The URL for OpenStreetMap has the form
`http://nominatim.openstreetmap.org/search/3718%20Locust%20Walk,%20Philadelphia,%20PA?format=json&addressdetails=1&limit=1`
You can see the address for Penn's McNeil Building embedded in this URL. Spaces need to be replaced with `%20` (the space character has ASCII code 20). Let's see what data we get back from this URL.
```{r comment="", results='hold'}
scan("http://nominatim.openstreetmap.org/search/3718%20Locust%20Walk,%20Philadelphia,%20PA?format=json&addressdetails=1&limit=1",
what="", sep="\n")
```
It is messy, but readable. You can see embedded in this text the `lat` and `lon` for this address. You can also see that it should not be too hard for a machine to extract these coordinates, and the rest of the information here, from this block of text. This is the point of JSON, producing data in a format that a human could understand in a small batch, but a machine could pr
ocess fast and easily.
Fortunately, the `jsonlite` R package facilitates the conversion of JSON text like this into convenient R objects.
```{r comment="", results='hold'}
library(jsonlite)
fromJSON("http://nominatim.openstreetmap.org/search/3718%20Locust%20Walk,%20Philadelphia,%20PA?format=json&addressdetails=1&limit=1")
```
`fromJSON()` converts the results from the OpenStreetMap geocoder to a row in a data frame. The JSON tags turn into column names and the values are placed as data in a row.
The major drawback of the OpenStreetMap geocoder is that it does not handle intersections. If we try to geocode "38th St and Walnut St", the results come back empty.
```{r comment="", results='hold'}
fromJSON("http://nominatim.openstreetmap.org/search/38th%20St%20and%20Walnut%20St,%20Philadelphia,%20PA?format=json&addressdetails=1&limit=1")
```
This is where the ArcGIS geocoder comes in handy. The ArcGIS geocoder also works with JSON but the URL construction is a little different. Here we get several results, but clearly the first one is the one that we want. It is also the one that has the highest `score`.
```{r comment="", results='hold'}
fromJSON("https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/findAddressCandidates?f=json&singleLine=38th%20St%20and%20Walnut%20St,%20Philadelphia,%20PA&outFields=Match_addr,Addr_type")
```
To make geocoding using these two services a little more convenient, we can create two functions that automate the process of taking an address, filling in `%20` for spaces in the appropriate URL, and retrieving the JSON results from the geocoding service.
```{r comment="", results='hold'}
geocodeOSM <- function(address)
{
a <- gsub(" +", "\\%20", address)
a <- paste0("http://nominatim.openstreetmap.org/search/",
a,
"?format=json&addressdetails=0&limit=1")
return( fromJSON(a) )
}
geocodeARCGIS <- function(address)
{
a <- gsub(" +", "\\%20", address)
a <- paste0("https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/findAddressCandidates?f=json&singleLine=",
a,
"&outFields=Match_addr,Addr_type")
return( fromJSON(a) )
}
```
Let's test out `geocodeOSM()` by pulling up a up a map of the geocoded coordinates. Once we have the latitude and longitude for the McNeil Building, where we typically hold our crime data science courses at Penn, we can use `leaflet()` to show us a map of the area.
```{r comment="", message=FALSE, fig.cap="OSM geocoding result for 3718 Locust Walk"}
library(leaflet)
gcPenn <- geocodeOSM("3718 Locust Walk, Philadelphia, PA")
leaflet() |>
addTiles() |>
setView(lng=gcPenn$lon, lat=gcPenn$lat, zoom=18) |>
addCircleMarkers(lng=as.numeric(gcPenn$lon),
lat=as.numeric(gcPenn$lat))
```
`leaflet()` prepares the mapping process. `addTiles()` pulls in the relevant map image (buildings and streets). `setView()` takes the longitude and latitude from our `gcPenn` object, sets that as the center of the map, and zooms in to level "18," which is a fairly close zoom of about one block. `addCircleMarkers()` creates a circle at the selected point.
ThE map shows a map of the western part of the Penn campus with 3718 Locust Walk (the square building near the center of the map) near the center of the map. It is close... off by one building. Maybe the ArcGIS geocoder will do better.
```{r comment="", message=FALSE, fig.cap="ArcGIS geocoding result for 3718 Locust Walk"}
gcPenn <- geocodeARCGIS("3718 Locust Walk, Philadelphia, PA")
gcPenn <- gcPenn$candidates[1,]
gcPenn$lon <- gcPenn$location$x
gcPenn$lat <- gcPenn$location$y
leaflet() |>
addTiles() |>
setView(lng=gcPenn$lon, lat=gcPenn$lat, zoom=18) |>
addCircleMarkers(lng=gcPenn$lon,
lat=gcPenn$lat)
```
We are almost ready to throw all of our addresses at these geocoders, but let's first make sure the addresses look okay. Several OISs are missing locations.
```{r comment="", results='hold'}
i <- which(ois$location %in% c("","withheld","Withheld"))
ois$text[i]
```
Several of these text descriptions of the incidents contain the location information. Let's fill those in
```{r comment="", results='hold'}
ois$location[ois$id=="16-18"] <- "3200 block of Wellington Street"
ois$location[ois$id=="10-06"] <- "Howard and Grange Street"
ois$location[ois$id=="08-06"] <- "200 block of Clapier Street"
ois$location[ois$id=="08-18"] <- "900 block of E. Slocum Street"
ois$location[ois$id=="08-30"] <- "700 block of W. Rockland Street"
ois$location[ois$id=="08-40"] <- "5400 Jefferson Street"
ois$location[ois$id=="08-60"] <- "3000 Memphis Street"
ois$location[ois$id=="08-70"] <- "1300 block of S. 29th Street"
ois$location[ois$id=="08-74"] <- "5600 block of N. Mascher Street"
```
While browsing locations sometimes you might come across ones that aren't quite right, this one for example. fixing.
```{r comment="", results='hold'}
# This one was just "51st Arch"
ois$location[ois$id=="07-19"] <- "51st St and Arch"
```
The text of this incident indicates that Philadelphia PD officers were not involved in the shooting.
```{r comment="", results='hold'}
ois$text[ois$id=="17-08"]
ois <- subset(ois, id != "17-08")
```
Several of the addresses are of the form "5400 block of Erdick St". I want these to get geocoded to the middle of the block. So I'm going to change addresses like these to be like "5450 Erdick St".
```{r comment="", results='hold'}
# put "blocks" at the midpoint
ois$location <- gsub("00 block( of)?", "50", ois$location, ignore.case=TRUE)
ois$location <- gsub("unit bl(oc)?k( of)?", "50", ois$location, ignore.case=TRUE)
# and one additional cleanup "Rear Alley of 300 block of N. 55th Street"
ois$location <- gsub("Rear Alley of |near ", "", ois$location, ignore.case = TRUE)
```
Now let's run all of the addresses through the OpenStreetMap geocoder. We could have geocoded all these addresses with the more simple code `lapply(a, geocodeOSM)`. However, if the JSON connection to the OpenStreetMap website fails for even one of the addresses (likely if you have a poor internet connection), then the whole `lapply()` function fails. With the for-loop implementation, if the connection fails, then `gcOIS` still keeps all of the prior geocoding results and you can restart the for-loop at the point where it failed.
```{r comment="", results='hold'}
# add city and state to each address to improve geocoding accuracy
a <- paste0(ois$location,", Philadelphia, PA")
# create a list to store the geocoding results
gcOIS <- vector("list", nrow(ois))
for(i in 1:nrow(ois))
{
gcOIS[[i]] <- geocodeOSM(a[i])
if(length(gcOIS[[i]]) == 0)
{
cat("Could not geocode address #",i,":",a[i],"\n")
}
}
```
Geocoding with OSM failed on numerous addresses. Some of them predictably failed as they are formatted poorly (e.g. Harbison Ave -Phila, Philadelphia, PA). Many others fail because they are intersections and OpenStreetMap does not handle intersections (yet). So let's send the addresses that OpenStreetMap could not geocode to the ArcGIS geocoder.
```{r comment="", results='hold'}
for(i in which(lengths(gcOIS)==0))
{
gcOIS[[i]] <- geocodeARCGIS(a[i])
}
```
Now `gcOIS` has some of the geocoding results from OpenStreetMap and some from ArcGIS. The results are in different formats. The ArcGIS results have a component named `candidates`. So we'll use the presence/absence of a `candidates` component to figure out which geocoding service delivered the results. Then we can extract the longitude, latitude, and some additional features. The ArcGIS geocoder gives numerous results, but we will just take the top scoring one, which is the most likely match.
```{r comment="", results='hold'}
b <- lapply(gcOIS, function(x)
{
if(is.null(x$candidates)) # OSM
{
a <- data.frame(lon=as.numeric(x$lon),
lat=as.numeric(x$lat),
score=as.numeric(x$importance),
loctype=paste(x$class, x$type, sep=":"),
method="osm",
addressGeo=x$display_name)
} else # ArcGIS
{
a <- data.frame(lon=x$candidates$location[1,"x"],
lat=x$candidates$location[1,"y"],
score=x$candidates$score[1],
loctype=x$candidates$attributes$Addr_type[1],
method="arcgis",
addressGeo=x$candidates$attributes$Match_addr[1])
}
return(a)
})
gcOIS <- do.call(rbind, b)
# add a column containing the original address
gcOIS <- cbind(gcOIS, addressOrig = a)
# add an ID column
gcOIS <- cbind(id=ois$id, gcOIS)
head(gcOIS)
```
Now it appears that we have longitude and latitude for every incident. We should check that they all look sensible.
```{r comment="", results='hold'}
stem(gcOIS$lat)
stem(gcOIS$lon)
```
All the points have latitude around 39 and 40 and longitude around -75.
There's no perfect way to check the geocoding results aside from checking each individual point and verify with a map that it lands on the right spot. We can do things like check for strange outliers. We can spot check a few locations with a map.
```{r comment="", message=FALSE, fig.cap=gcOIS$addressOrig[i]}
i <- which(ois$id=="18-01")
leaflet() |>
addTiles() |>
setView(lng=gcOIS$lon[i], lat=gcOIS$lat[i], zoom=18) |>
addCircleMarkers(lng=gcOIS$lon[i], lat=gcOIS$lat[i]) |>
addLabelOnlyMarkers(lng=gcOIS$lon[i], lat=gcOIS$lat[i],
label = gcOIS$addressOrig[i],
labelOptions = labelOptions(noHide = TRUE,
direction = 'top',
textOnly = TRUE))
```
```{r comment="", message=FALSE, fig.cap=gcOIS$addressOrig[i]}
i <- which(ois$id=="13-14")
leaflet() |>
addTiles() |>
setView(lng=gcOIS$lon[i], lat=gcOIS$lat[i], zoom=18) |>
addCircleMarkers(lng=gcOIS$lon[i], lat=gcOIS$lat[i]) |>
addLabelOnlyMarkers(lng=gcOIS$lon[i], lat=gcOIS$lat[i],
label = gcOIS$addressOrig[i],
labelOptions = labelOptions(noHide = TRUE,
direction = 'top',
textOnly = TRUE))
```
```{r comment="", message=FALSE, fig.cap=gcOIS$addressOrig[i]}
i <- which(ois$id=="07-13")
leaflet() |>
addTiles() |>
setView(lng=gcOIS$lon[i], lat=gcOIS$lat[i], zoom=18) |>
addCircleMarkers(lng=gcOIS$lon[i], lat=gcOIS$lat[i]) |>
addLabelOnlyMarkers(lng=gcOIS$lon[i], lat=gcOIS$lat[i],
label = gcOIS$addressOrig[i],
labelOptions = labelOptions(noHide = TRUE,
direction = 'top',
textOnly = TRUE))
```
The first these was geocoded correctly. I looked up 2850 Kensington in Google Maps to verify its location. You can see Arrott St and Frankford St, but the dot is not at their intersection. The address shown in the map title is missing an "and" between "Arrott Street" and "Frankford Avenue". So where did geocoding place our point?
```{r comment="", results='hold'}
subset(gcOIS, id %in% c("18-01","13-14","07-13"))
```
Note that the `loctype` for the first address is "place:house" and `loctype` for the other two are "railway:station" and "landuse:industrial". "place:house" signals that the geocoding was accurate to a very specific location. But the others are problematic. OpenStreetMap did not understand that we were looking for intersections and instead geocoded to a transit center and an industrial location.
Ideally we want geocoding to get us to a house, intersection, or address. If `loctype` is residential, locality, city, or streetname then this indicates that the geocoding did not get us to a very specific The 13th and Pike Streets geocoding seemed to fail completely.
```{r comment="", results='hold'}
sort(table(gcOIS$loctype))
```
Several of these are very specific locations (office, shop, building, station, house, PointAddress, StreetAddress, StreetInt). Many others are not specific at all (highway, neighborhood, city, Locality, StreetName). Each of these needs to be revisited. Let's examine one these here. Let's look at those geocoded down to `loctype="Streetname"`.
```{r comment="", results='hold'}
subset(gcOIS, loctype=="StreetName")
```
We can see that the problem for many of these is simply bad addresses. Some are intersections that are missing the "and" between the two streets. Let's read into the incident at Harbison Ave to see if we can learn more.
```{r comment="", results='hold'}
ois$text[ois$location=="Harbison Ave -Phila"]
```
The text describes the path of the suspect moving south on Torresdale and turning west on Bridge. Let's pull up a map of this area. I've overlayed the path of the suspect on top of the map. The officers describe the suspect moving "south" on Torresdale and "west" on Bridge because the I95 and the Delaware River are just south of this area and run more east-west here. So thinking that I95 run north-south and that the Delaware River is the east boundary of Philadelphia, officers may decribe someone running "west" on Bridge St as running away from I95 and the Delaware River.
```{r comment="", results='hold', message=FALSE, fig.cap="Path of individual involved in OIS 12-20"}
leaflet() |>
addTiles() |>
setView(lng=-75.0691987, lat=40.0138755, zoom=18) |>
addPolylines(lng=c(-75.068621,-75.070423,-75.070594),
lat=c( 40.013570, 40.013184, 40.013710),
color = "red")
```
It took some investigation, but we can fix this address too. Here's how we can fix the two addresses that we've identified fixes for so far (The rest are left as an exercise). `textConnection()` is a nice trick for making a small data frame right inside a script. I'm going to make two columns, one with the original address and one with the correction or improved address.
```{r comment="", message=FALSE}
a <- textConnection(
"addressOrig,addressFix
Arrott Street Frankford Avenue,Arrott Street and Frankford Avenue
13th and Pike Streets,13th and Pike St
Harbison Ave -Phila,Bridge St and Ditman St")
a <- read.csv(a)
# make sure the original addresses match, no NAs!
i <- match(a$addressOrig, ois$location)
i
b <- lapply(paste0(a$addressFix,", Philadelphia, PA"), geocodeARCGIS)
b <- lapply(b, function(x)
{
data.frame(lon=x$candidates$location[1,"x"],
lat=x$candidates$location[1,"y"],
score=x$candidates$score[1],
loctype=x$candidates$attributes$Addr_type[1],
method="arcgis",
addressGeo=x$candidates$attributes$Match_addr[1],
stringsAsFactors = FALSE)
})
b <- do.call(rbind, b)
gcOIS[i,names(b)] <- b
gcOIS[i,]
```
Both appear to be fixed. `addressGeo` appears correct and the `loctype` is now `StreetInt`. All good signs.
Let's create new longitude/latitude columns in our `ois` data frame so that one object contains all of our essential data.
```{r comment="", results='hold', message=FALSE}
ois$lon <- gcOIS$lon
ois$lat <- gcOIS$lat
```
We'll proceed as if we're satisfied with our geocoding even though you know you have more work to do to fix some of those non-specific geocoding problems. We close this section with a map of the city of Philadelphia and the locations of all officer involved shootings.
```{r comment="", results='hold', message=FALSE, fig.cap="All Philadelphia Officer-involved Shootings", cache=FALSE}
leaflet() |>
addTiles() |>
fitBounds(-75.288486,39.868285,-74.950965,40.138251) |>
addCircles(lng=gcOIS$lon, lat=gcOIS$lat, color="red",
popup = paste(gcOIS$id, gcOIS$addressOrig))
```
# Working with shapefiles and coordinate systems
The Philadelphia Police Department divides the city into Police Service Areas (PSAs). The city provides a *shapefile*, a file containing geographic data, that describes the boundaries of the PSAs at Philadelphia's [open data site](https://www.opendataphilly.org/dataset/police-service-areas). R can read these files using the `st_read()` function provided in the `sf` (simple features) package. Even though `st_read()` appears to only be accessing `Boundaries_PSA.shp`, you should have all of the `Boundaries_PSA` files in your `10_shapefiles_and_data` folder. The other files have information that `st_read()` needs, like the coordinate system stored in `Boundaries_PSA.prj`. If you do not have all `Boundaries_PSA` files in your folder, then in a few lines you will get errors like "the sfc object should have crs set," meaning that the Coordinate Reference System (CRS) is missing.
```{r comment="", results='hold', message=FALSE}
library(sf)
PPDmap <- st_read("10_shapefiles_and_data/Boundaries_PSA.shp")
```
`PPDmap` is an `sf` (simple features) object. It is not unlike a data frame, but it can contain a column containing geographic information associated with a row of other data. Here are the two columns in `PPDmap` that are of primary interest.
```{r comment="", results='hold', message=FALSE}
PPDmap[,c("PSA_NUM","geometry")]
```
The first column shows the PSA number and the second column shows a truncated description of the geometry associated with this row. In this case, `geometry` contains the coordinates of the boundary of the PSA for each row. Use `st_geometry()` to extract the polygons to make a plot.
```{r comment="", results='hold', message=FALSE}
plot(st_geometry(PPDmap))
axis(side=1) # add x-axis
axis(side=2) # add y-axis
# extra the center points of each PSA
a <- st_coordinates(st_centroid(st_geometry(PPDmap)))
# add the PSA number to the plot
text(a[,1], a[,2], PPDmap$PSA_NUM, cex=0.5)
```
We can extra the actual coordinates of one of the polygons if we wish.
```{r comment="", message=FALSE}
a <- st_coordinates(PPDmap$geometry[1])
head(a)
tail(a)
```
And we can use those coordinates to add additional features to our plot
```{r comment="", message=FALSE}
plot(st_geometry(PPDmap))
axis(side=1)
axis(side=2)
a <- st_coordinates(st_centroid(st_geometry(PPDmap)))
text(a[,1], a[,2], PPDmap$PSA_NUM, cex=0.5)
a <- st_coordinates(PPDmap$geometry[1])
lines(a[,1], a[,2], col="red", lwd=3)
```
So this highlighted in red PSA 77 in the southern end of Philadelphia. Note that R issued some warnings about our centroid locations. We will return to that in a moment.
Rather than extracting coordinates to add a feature to a plot, `subset()` provides an easier method.
```{r comment="", message=FALSE}
plot(st_geometry(PPDmap))
axis(side=1)
axis(side=2)
a <- st_coordinates(st_centroid(st_geometry(PPDmap)))
text(a[,1], a[,2], PPDmap$PSA_NUM, cex=0.5)
plot(st_geometry(subset(PPDmap, PSA_NUM=="077")),
add=TRUE, border="red")
plot(st_geometry(subset(PPDmap, PSA_NUM=="183")),
add=TRUE, border="green")
```
Setting `add=TRUE` in the last two calls to `plot()` asks R to overlay the current plot with these additional objects.
Now, back to those warnings we received about calculating centroids with longitude/latitude data. Geographic datasets that describe locations on the surface of the earth have a "coordinate reference system" (CRS). Let's extract the CRS for `PPDmap`.
```{r comment="", results='hold', message=FALSE}
st_crs(PPDmap)
```
The the coordinate system used to describe the PPD boundaries is the World Geodetic System 1984 (WGS84) maintained by the United States National Geospatial-Intelligence Agency, one of several standards to aid in navigation and geography. The European Petroleum Survey Group (EPSG) maintains a catalog of different coordinate systems (should be no surprise that oil exploration has driven the development of high quality geolocation standards). They have assigned the standard longitude/latitude coordinate system to be [EPSG4326]((http://spatialreference.org/ref/epsg/4326/). You can find the full collection of coordinate systems at [spatialreference.org](http://spatialreference.org/ref/epsg/). You can see in the output above a reference to EPSG 4326.
Many of us are comfortable with the longitude/latitude angular coordinate systems. However, the distance covered by a degree of longitude shrinks as you move towards the poles and only equals the distance covered by a degree of latitude at the equator. In addition, the earth is not very spherical so the coordinate system used for computing distances on the earth surface might need to depend on where you are on the earth surface.
Almost all web mapping tools (Google Maps, ESRI, OpenStreetMaps) use the pseudo-Mercator projection ([EPSG3857](http://spatialreference.org/ref/epsg/3857/)). Let's convert our PPD map to that coordinate system.
```{r comment="", results='hold', message=FALSE}
PPDmap <- st_transform(PPDmap, crs=3857)
st_crs(PPDmap)
```
The CRS now indicates that this is a Mercator projection with distance measured in meters (`UNIT["Meter",1]`). Now if we ask for the centroids of the PSAs, we get more accurate centroids and no warnings from R.
```{r comment="", results='hold', message=FALSE}
st_centroid(st_geometry(PPDmap))
```
We can use the same projection, but modify it so that distances are measured in feet. You should see that the `UNIT[]` has changed to Foot_US.
```{r comment="", results='hold', message=FALSE}
PPDmap <- st_transform(PPDmap, crs=st_crs("+init=epsg:3857 +units=us-ft"))
st_crs(PPDmap)
```
There is a special coordinate system for every part of the world. A useful coordinate system for the Philadelphia area is [EPSG2272](http://spatialreference.org/ref/epsg/2272/). Let's convert our PPD map to that coordinate system.
```{r comment="", results='hold', message=FALSE}
PPDmap <- st_transform(PPDmap, crs=2272)
st_crs(PPDmap)
```
This coordinate system is the Lambert Conic Conformal (LCC). This particular projection of the `PPDmap` is tuned to provide good precision for the southern part of Pennsylvania (note the parallel coordinates are at the latitude of southern Pennsylvania and the meridian is a little west of Philadelphia) and distances are measured in feet (note the `UNIT` tag in the CRS description).
Let's transform back to longitude/latitude. It really is best to work using a different coordinate system, but I'm going to stick with longitude/latitude so that the values make a little more sense to us. Also at the scale of Philadelphia, we're just using the centroid calculation to figure out where to put labels. `st_transform()` does allow you to just use the EPSG code by itself.
```{r comment="", results='hold', message=FALSE}
PPDmap <- st_transform(PPDmap, crs=4326)
```
Now both PPD data and polygons are on the same scale
```{r comment="", results='hold', message=FALSE}
plot(st_geometry(PPDmap), axes=TRUE)
points(lat~lon, data=gcOIS, col=rgb(1,0,0,0.5), pch=16)
```
To make the dots a little transparent, I've used the `rgb()` function with which you can mix red, green, and blue colors and set the transparency. The `1` tells `rgb()` to use maximum red. The two `0`s tell `rgb()` to use no green or blue. The 0.5 tells `rgb()` to make the dots halfway transparent.
# Spatial joins
Spatial joins is the process of linking two data sources by their geography. For the case of the OIS data, we want to know how many OISs occurred in each PSA. To do this we need to drop each OIS point location into the PSA polygons and have R tell us in which polygon did each OIS land.
First we need to convert our `ois` data frame to an `sf` object, communicating to R that the `lon` and `lat` columns are special. At this stage we also have to communicate in what coordinate system are the `lon` and `lat` values. `st_as_sf()` converts an R object into an `sf` object.
```{r comment="", results='hold', message=FALSE}
ois <- st_as_sf(ois,
coords=c("lon","lat"),
crs=4326)
# check that everything looks okay
plot(st_geometry(PPDmap), axes=TRUE)
plot(st_geometry(ois), add=TRUE, col=rgb(1,0,0,0.5), pch=16)
```
`st_join()` will match each row in `ois` to each polygon in PSA. I just want the `PSA_NUM` column out of the `PPDmap`.
```{r comment="", results='hold', message=FALSE}
PSAlookup <- st_join(ois, PPDmap[,"PSA_NUM"])
PSAlookup[1:3, c("id","date","location","PSA_NUM","geometry")]
```
Now our `PSAlookup` contains everything from `ois` but also adds a new column `PSA_NUM`.
Let's examine the PSAs with the most OISs and highlight them on the map.
```{r comment="", message=FALSE}
a <- rev(sort(table(PSAlookup$PSA)))
a
plot(st_geometry(PPDmap), axes=TRUE)
# plot the OISs in the PSA with the most OISs
i <- which(PSAlookup$PSA_NUM==names(a)[1])
plot(st_geometry(PSAlookup[i,]), add=TRUE, col=rgb(0,1,0,0.5), pch=16)
# plot the OISs in the PSA with the second most OISs
i <- which(PSAlookup$PSA_NUM==names(a)[2])
plot(st_geometry(PSAlookup[i,]), add=TRUE, col=rgb(0,0,1,0.5), pch=16)
```
Let's identify which OISs occurred in the same PSA as the University of Pennsylvania. We've already geocoded Penn and have its coordinates.
```{r comment="", message=FALSE}
gcPenn
# map them to the spatial coordinates
st_join(st_as_sf(gcPenn,
coords=c("lon","lat"),
crs=4326),
PPDmap)
```
Now we see that Penn is in PSA 183 and we can highlight those points on the map.
```{r comment="", message=FALSE}
plot(st_geometry(PPDmap), axes=TRUE)
i <- which(PSAlookup$PSA_NUM=="183")
plot(st_geometry(PSAlookup[i,]), add=TRUE, col=rgb(1,0,0,0.5), pch=16)
```
Lastly, we will tabulate the number of OISs in each PSA and color the map by the number of OISs.
```{r comment="", message=FALSE}
# how many shootings in each PSA?
a <- table(PSAlookup$PSA_NUM)
a
# merge the shooting count into the PPDmap data
i <- match(PPDmap$PSA_NUM, names(a))
PPDmap$nShoot <- a[i]
PPDmap[1:3,]
```
We can see that `PPDmap` now has a new `nShoot` column. A histogram will show what kinds of counts we observe in the PSAs.
```{r comment="", message=FALSE}
hist(a, xlab="Number of OISs", ylab="Number of PSAs", main="")
```
Let's discretize the OIS counts into a few categories.
```{r comment="", message=FALSE}
a <- cut(PPDmap$nShoot,
breaks=c(0,1,5,10,15,20,25,30),
right=FALSE)
a
```
`cut()` converts all of the individual counts into categories, like [1,5) or [25,30). For each of these categories we will associate a color for the map. `heat.colors()` will generate a sequence of colors in the yellow, orange, red range.
```{r comment="", message=FALSE}
col <- rev(heat.colors(7,1))
col
```
These are eight digit codes describing the color. The first two digits correspond to red, digits three and four correspond to green, digits five and six correspond to blue, and the last two digits correspond to transparency. These are hexadecimal numbers (base 16). Hexadecimal numbers use the digits 0-9, like normal decimal system numbers, and then denote 10 as A, 11 as B, on up to 15 as F. So FF as a decimal is $15 \times 16 + 15 = 255$, which is the maximum value for a two digit hexadecimal. The hexadecimal 80 as a decimal is $8 \times 16 + 0 = 128$, which is in the middle of the range 0 to 255. So the first color code, FFFF80FF, means maximum red, maximum green, half blue, and not transparent at all. This mixture is known more commonly as "yellow".
Now we need to select the right color for each PSA. If I apply `as.numeric()` to `a`, then all the [0,1) will convert to 1, the [1,5) will convert to 2, and so on up to [25,30) converting to 7. So `col[as.numeric(a)]` will pick out the right color for each PSA. Now create the map coloring each PSA using the right color.
```{r comment="", message=FALSE}
plot(st_geometry(PPDmap), col=col[as.numeric(a)], border=NA)
# add the number of shootings to the map
a <- st_coordinates(st_centroid(PPDmap))
text(a[,1], a[,2], PPDmap$nShoot, cex=0.5)
```
Those PSAs with the least shootings are a very pale yellow. As we examine PSAs with a greater number of OISs, their colors get redder and redder.
# Summary
We started with just a web page linking to a collection of pdf files. We used regular expressions to extract everything we could from the web page tables. We had R "read" the pdf files to extract the dates that were not readily available. We geocoded the stops so that we could put them on a map. Finally, we could tabulate by PSA the number of OISs and map those as well.
If you've worked through all of this, then I would recommend that you save your objects, using `save(ois, PSAlookup, gcOIS, file="PPDOIS.RData")`. That way you will not have to scrape everything off the web again or redo any geocoding.
# Exercises
`r .exNum('Revisit the geocoding section discussing geocoding errors. Examine the OISs that have not been geocoded to specific locations. Fix their addresses and redo the geocoding of these OISs to improve the accuracy of the data.')`
`r .exNum('Identify officer-involved shootings that resulted in the offender being transported to the Hospital at the University of Pennsylvania. Create a map marking the location of HUP, the location of officer-involved shootings resulting in the offender being transported to HUP, and the locations of all other shootings.')`
`r .exNum('For each shooting determine which hospital treated the offender. Use @@st_distance()@@ to determine what percentage of those shot in an OIS went to the closest hospital.')`