-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path01 Intro to R.Rmd
1100 lines (956 loc) · 52 KB
/
01 Intro to R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Introduction to R"
author:
- affiliation: University of Pennsylvania
email: [email protected]
name: Greg Ridgeway
- affiliation: University of Pennsylvania
email: [email protected]
name: Ruth Moyer
- affiliation: University of Pennsylvania
email: [email protected]
name: Li Sian Goh
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
css: htmlstyle.css
---
<!-- HTML YAML header Ctrl-Shift-C to comment/uncomment -->
<!-- --- -->
<!-- title: "Introduction to R" -->
<!-- author: -->
<!-- - Greg Ridgeway ([email protected]) -->
<!-- - Ruth Moyer ([email protected]) -->
<!-- date: "`r format(Sys.time(), '%B %d, %Y')`" -->
<!-- output: -->
<!-- pdf_document: -->
<!-- latex_engine: pdflatex -->
<!-- html_document: default -->
<!-- fontsize: 11pt -->
<!-- fontfamily: mathpazo -->
<!-- --- -->
<!-- PDF YAML header Ctrl-Shift-C to comment/uncomment -->
<!-- A function for automating the numbering and wording of the exercise questions -->
```{r echo=FALSE}
.counterExercise <- 0
.exerciseQuestions <- NULL
.exNum <- function(.questionText="")
{
.counterExercise <<- .counterExercise+1
.exerciseQuestions <<- c(.exerciseQuestions, .questionText)
.questionText <- gsub("@@", "`", .questionText)
return(paste0(.counterExercise,". ",.questionText))
}
```
# Introduction
This is the first set of notes for an introduction to R programming from criminology and criminal justice. These notes assume that you have the latest version of R and R Studio installed. We are also assuming that you know how to start a new script file and submit code to the R console. From that basic knowledge about using R, we are going to start with `2+2` and by the end of this set of notes you will load in a small Chicago crime dataset, create a few plots, count some crimes, and be able to subset the data. Our aim is to build a firm foundation on which we will build throughout this set of notes.
R sometimes provides useful help as to how to do something, such as choosing the right function or figuring what the syntax of a line of code should be. Let's say we're stumped as to what the `sqrt()` function does. Just type `?sqrt` at the R prompt to read documentation on `sqrt()`. Most help pages have examples at the bottom that can give you a better idea about how the function works. R has over 7,000 functions and an often seemingly inconsistent syntax. As you do more complex work with R (such as using new packages), the Help tab can be useful.
# Basic Math and Functions in R
R, on a very unsophisticated level, is like a calculator.
```{r comment="", results='hold'}
2+2
1*2*3*4
(1+2+3-4)/(5*7)
sqrt(2)
(1+sqrt(5))/2 # golden ratio
2^3
log(2.718281828)
round(2.718281828,3)
12^2
factorial(4)
abs(-4)
```
# Combining values together into a collection (or vector)
We will use the `c()` function a lot. `c()` *c*ombines elements, like numbers and text to form a vector or a collection of values. If we wanted to combine the numbers 1 to 5 we could do
```{r comment=""}
c(1,2,3,4,5)
```
With the `c()` function, it's important to separate all of the items with commas.
Conveniently, if you want to add 1 to each item in this collection, there's no need to add 1 like `c(1+1,2+1,3+1,4+1,5+1)`... that's a lot of typing. Instead R offers the shortcut
```{r comment=""}
c(1,2,3,4,5)+1
```
In fact, you can apply any mathematical operation to each value in the same way.
```{r comment="", results='hold'}
c(1,2,3,4,5)*2
sqrt(c(1,2,3,4,5))
(c(1,2,3,4,5)-3)^2
abs(c(-1,1,-2,2,-3,3))
```
Note in the examples below that you can also have a collection of non-numerical items. When combining text items, remember to use quotes around each item.
```{r comment="", results='hold'}
c("CRIM600","CRIM601","CRIM602","CRIM603")
c("yes","no","no",NA,NA,"yes")
```
In R, `NA` means a missing value. We'll do more exercises later using data containing some `NA` values. In any dataset, you're virtually guaranteed to find some NAs. The function `is.na()` helps determine whether there are any missing values (any NAs). In some of the problems below, we'll use `is.na()`.
You can use double quotes or single quotes in R as long as you are consistent. When you have quotes inside the text you need to be particularly careful.
```{r comment="", results='hold'}
"Lou Gehrig's disease"
'The officer shouted "halt!"'
```
The backslashes in the above text "protect" the double quote, communicating to you and to R that the next double quote is not the end of the text, but a character that is actually part of the text you want to keep.
The `c()` function isn't the only way to make a collection of values in R. For example, placing a `:` between two numbers can return a collection of numbers in sequence. The functions `rep()` and `seq()` produce repeated values or sequences.
```{r comment="", results='hold'}
1:10
5:-5
c(1,1,1,1,1,1,1,1,1,1)
rep(1,10)
rep(c(1,2),each=5)
seq(1, 5)
seq(1, 5, 2)
```
R will also do arithmetic with two vectors, doing the calculation pairwise. The following will compute 1+11 and 2+12 up to 10+20.
```{r comment=""}
1:10 + 11:20
```
Yet, other functions operate on the whole collection of values in a vector. See the following examples:
```{r comment="", results='hold'}
sum(c(1,10,3,6,2,5,8,4,7,9)) # sum
length(c(1,10,3,6,2,5,8,4,7,9)) # how many?
cumsum(c(1,10,3,6,2,5,8,4,7,9)) # cumulative sum
mean(c(1,10,3,6,2,5,8,4,7,9)) # mean of collection of 10 numbers
median(c(1,10,3,6,2,5,8,4,7,9)) # median of same population
```
There are also some functions in R that help us find the biggest and smallest values. For example:
```{r comment="", results='hold'}
max(c(1,10,3,6,2,5,8,4,7,9)) # what is the biggest value in vector?
which.max(c(1,10,3,6,2,5,8,4,7,9)) # in which "spot" would we find it?
min(c(1,10,3,6,2,5,8,4,7,9)) # what is the smallest value in vector?
which.min(c(1,10,3,6,2,5,8,4,7,9)) # in which "spot" would we find it?
```
A lot of functions in R are to help you see and understand what's in a dataset. For example, we can rearrange a collection of values in ascending or descending order. Note the `order()` function. How is it similar to the `which.max()` or `which.min()` function? Note the `sort()` function.
```{r comment="", results='asis'}
sort(c(1,10,3,6,2,5,8,4,7,9))
rev(c(1,10,3,6,2,5,8,4,7,9))
rev(sort(c(1,10,3,6,2,5,8,4,7,9)))
sort(c(1,10,3,6,2,5,8,4,7,9),decreasing=TRUE)
order(c(1,10,3,6,2,5,8,4,7,9))# where is the ith biggest number?
rank(c(1,100,3,20)) #how does each value rank compared to others?
```
The above examples have involved mostly numerical values in a vector. Here are some examples involving non-numerical "character" values. Let's create an object called `my.states` (a name I made up) that will contain the postal codes of places in which I've lived or worked.
```{r comment="", results='hold'}
my.states <- c("WA","DC","CA","PA","MD","VA","OH")
```
Take a look at the arrow `<-` (pronounced 'gets'). This is how you tell R to take the result of what is on the right and store it in an object named on the left. We're going to talk more about this arrow soon. Now let's run some new functions on this collection of postal codes.
```{r comment="", results='hold'}
nchar(my.states)
paste(my.states, ", USA")
paste(my.states, ", USA", sep="")
paste0(my.states, ", USA")
paste(my.states, collapse=",")
```
The `nchar()` function counts how many characters are in each character string. The `paste()` function pastes character strings together. By default, `paste()` puts a space betweeen the strings being pasted together. It looks strange with that space after WV in "WV , USA". We can set the separator to be nothing (the empty string) by setting `sep=""`. `paste0()` is a shortcut for pasting with `sep=""`. Setting `collapse=","` combines all the text together, collapsing them into one string with a comma as a separator.
## Exercises
`r .exNum("Print all even numbers less than 100")`
`r .exNum("What is the mean of even numbers less than 100")`
`r .exNum('Have R put in alphabetical order \x60c("WA","DC","CA","PA","MD","VA","OH")\x60')`
# Assignment of values to variables
The left-facing arrow symbol is an extremely important tool in R. Try the following:
```{r comment="", results='hold'}
a <- 1
```
Now type:
```{r comment="", results='hold'}
a
```
R has assigned a the value of "1" - here are more examples:
```{r comment="", results='hold'}
b <- 2+2
a <- a+b
a <- 1:10
b <- 2*a
a+b
sd(a)
state.names <- c("WV","OH","OK","NV","CA","IN","MA","MI","IL","IA","SC","NH",
"LA","GA","CT","WI","CO","NY","UT","AK","MS","AL","OR","MT",
"ND","WY","FL","ME","AZ","TN","PA","MN","NM","SD","MO","RI",
"HI","WA","DE","NJ","NE","KY","AR","TX","NC","MD","VA","VT",
"KS","ID","DC")
```
R programmers typically pronounce the `<-` as "gets". So we would read `a <- 1` as "a gets one".
# Indexing
We can extract items from a vector, matrix, or data frame using indexing. In R, we use square brackets to index.
```{r comment="", results='hold'}
state.names[1] # get the first state
state.names[1:3] # get the first three states
state.names[c(1,5,9)] # get states 1, 5, and 9
state.names[2*(1:25)] # get the even states
```
If you put a negative number inside the `[]`, this will communicate to R to remove that item from the collection. Let's remove DC from `state.names` since it is not one of the 50 states. Since it is the 51st item in `state.names` we can remove it like this
```{r comment="", results='hold'}
state.names[-51]
```
Let's combine the sort and order functions from above (along with variable assignment) with the concept of indexing.
```{r comment="", results='hold'}
sort(state.names)[1] # sort, then give the first value
i <- order(state.names) # index the states in order
i[1:3] # which positions are the first three
state.names[i[1:3]] # show me those three states
```
Note that in the last example we used square brackets within square brackets. First, we asked R to give us the indices of the first three states in alphabetical order and that was `r i[1:3]`. Then R took those three values and plugged them into the second set of square brackets to show you the state names in those positions in the collection.
## Exercises
`r .exNum("What's the last state in the \x60state.names\x60?")`
`r .exNum('Pick out states that begin with "M" using their indices')`
`r .exNum("Pick out states where you have lived")`
`r .exNum("What's the last state in alphabetical order?")`
`r .exNum("What are the last three states in alphabetical order?")`
# Logical values and operations
Logical values in R are the two values `TRUE` and `FALSE`, always written in all capital letters in R. You can also combine a bunch of `TRUE` and `FALSE` values into a collection.
```{r comment="", results='hold'}
TRUE
FALSE
c(TRUE,FALSE,TRUE,FALSE)
```
We use logical operators to create logical expressions and R can evaluate them as either `TRUE` or `FALSE`. For example, `&` represents the logical "and" and `|` represents the logical "or."
```{r comment="", results='hold'}
TRUE & TRUE
FALSE & TRUE
FALSE | TRUE
FALSE | FALSE
```
We can use R to compare values using greater than or less than symbols. We can also express "greater than or equal to" or "less than or equal to." These will evaluate to `TRUE` or `FALSE` depending, of course, on whether the statement is true or false.
```{r comment="", results='hold'}
6>5
6<5
6>=5
5<=5
```
We can combine logical operators into more complicated expressions.
```{r comment="", results='hold'}
(6>5) | (100<3)
(6>5) & (100<3)
```
Here are some additional examples. We are going to make `a` be the values 1 to 10 and then use logical operators to ask a question (like "are you equal to?" or "are you smaller than?") of each of those values. Note that the double equal sign `==` asks the question whether the two values are the same.
```{r comment="", results='hold'}
a <- 1:10
a==5
a!=5 # ! means "not"
a<5
a>=5
a>5 & a<8
a<3 | a>=7
```
The `%%` operator computes the remainder after dividing the left side by the right side.
```{r comment="", results='hold'}
13 %% 5 # = 3, 13/5 = 2 with remainder 3
a %% 2 == 0 # here's a way to ask each number if it's even
```
There are special functions `any()` and `all()` that check whether all/any of the values are true.
```{r comment="", results='hold'}
all(a<11)
all(a>5 & a<8)
any(a>5 & a<8)
```
Logical values may be used inside square brackets too. R will show you the values corresponding to `TRUE`s inside the square brackets and will eliminate any values corresponding to `FALSE`s. For example, let's store in `i` `TRUE` for even numbers and `FALSE` for odd numbers. So `i` will consist of ten logical values. Putting `i` inside the square brackets will extract just the values of `a` for which `i` has a `TRUE`.
```{r comment="", results='hold'}
i <- a%%2==0
i
a[i]
```
We can use `!`, which means "not," to reverse all the logical values and get the values of `a` that are not even.
```{r comment="", results='hold'}
a[!i]
```
Before, we removed DC from the list of states by noticing that it was in position #51. This time, let's have R do the work of locating DC in the collection of states. We'll have R ask each element in `state.names` whether or not it equals "DC".
```{r comment="", results='hold'}
i <- state.names!="DC"
state.names[i]
state.names[state.names!="DC"] # can also put directly inside []
```
The R operator `%in%` asks each value on the left whether or not it is a member of the set on the right.
```{r comment="", results='hold'}
a %in% c(3,7,10)
my.states <- c("MD","OH","VA","CA","WA","DC")
# do the above states touch the Pacific Ocean? (Make a list of states that touch the Pacific Ocean and compare with my.states)
my.states %in% c("CA","OR","WA","AK","HI")
# how many of these states touch the Pacific Ocean?
sum(my.states %in% c("CA","OR","WA","AK","HI"))
```
Note in the last line we used `sum()` to count for how many of the elements in `my.states` did `%in%` evaluate to be `TRUE`.
## Exercises
`r .exNum("Report \x60TRUE\x60 or \x60FALSE\x60 for each state depending on if you have lived there")`
`r .exNum("With \x60a <- 1:100\x60, pick out odd numbers between 50 and 75")`
`r .exNum("Use greater than less than signs to get all state names that begin with M")`
# Sampling
The function `sample()` randomly shuffles a collection of values.
```{r comment="", results='hold'}
sample(1:10) # each time different values will appear
sample(1:10)
sample(1:10)
a <- sample(1:1000,size=10) # pick 10 numbers between 1-1000
a <- sample(1:6,size=1000,replace=TRUE) # roll a die 1000 times
```
Notice that `sample()` has several options including `size=` to indicate how many to select and `replace=` to indicate whether to sample with or without replacement. You can access the help on the `sample()` function by typing `?sample` at the R prompt.
# Tabulating
The `table()` function counts how many of each value appear in a collection. We just set `a` to be a random collection of numbers 1 to 6, simulating rolling a die. With `table()` we can see how often each number appeared.
```{r comment="", results='hold'}
table(a)
max(table(a)) # find out which value appears most frequently
```
## Exercises
`r .exNum("Use \x60sample()\x60 to estimate the probability of rolling a 6")`
`r .exNum("Use \x60sample()\x60 to estimate the probability that the sum of two die equal 7")`
`r .exNum("Use \x60sample()\x60 to select randomly five states without replacement")`
`r .exNum("Use \x60sample()\x60 to select randomly 1000 states with replacement")`
+ Tabulate how often each state was selected
+ Which state was selected the least? Make R do this for you
# Lists
So far we have worked with very simple collections of numbers or text or logical values. Eventually we will need to work with more complicated kinds of data, like datasets, maps, and other objects. R stores these more complex objects in a list. A list is essentially a collection of objects, potentially of different types. Let's start with a simple list.
```{r comment="", results='hold'}
a <- list(1:3,5:1,1:10)
a
```
The list `a` has three components, each of which is a collection of values and each has different length. Here's another list consisting of three components, each of which is a collection of different types, numeric, text, and logical values.
```{r comment="", results='hold'}
b <- list(0:9, c("A","B","C"),c(TRUE,FALSE,NA))
b
```
We use a double set of square brackets to access the components of a list. Let's say we just want the first component of `a`, just the part with the numbers 1, 2, and 3.
```{r comment="", results='hold'}
a[[1]]
```
We can even grab the first element in the first component of the list `a`.
```{r comment="", results='hold'}
a[[1]][1]
```
Or we just select the first and third component of the list `a`. This will return a new list, but just without the second component.
```{r comment="", results='hold'}
a[c(1,3)]
```
`lapply()` means "list apply" and lets us apply a given function to every item in a list and obtain a list in return. Let's say we want to sort each of the components in `a`. It would take too much typing to run `sort(a[[1]])` and `sort(a[[2]])` and `sort(a[[3]])`. Instead, `lapply()` can apply the sort function to each of the three components in `a`.
```{r comment="", results='hold'}
lapply(a,sort)
```
There is also a function `sapply()` that works in a manner quite similar to `lapply()`. The only difference is that `sapply()` will try to simplify the results. Think about the "s" meaning "simplified". Let's compute the number of elements in each component and the average of the numbers in each component.
```{r comment="", results='hold'}
sapply(a,length)
sapply(a,mean)
```
Since `length()` and `mean()` will return a single number for each component, the result can be simplified into a collection of three values, one for each component of the list.
Let's find the component that has the most values in it.
```{r comment="", results='hold'}
i <- which.max(sapply(a,length))
a[[i]]
```
If `sapply()` is not able to simplify the result, then the result is just like `lapply()`.
```{r comment="", results='hold'}
sapply(a,sort)
```
Let's return to our state example. Before we just had a collection of 51 postal codes. Instead, let's create a list that separates them into three components depending on whether they are in the west, east, or central United States.
```{r comment="", results='hold'}
state.list <- list(
west=c("AK","HI","WA","NV","CA","CO","UT","OR","AZ","NM","ID"),
east=c("KY","RI","PA","DE","DC","NJ","WV","MA","SC","NH","GA","CT","NY","IN",
"MS","AL","OH","NC","MD","VA","VT","FL","ME","TN"),
central=c("SD","MO","MN","ND","WY","OK","MI","IL","IA","LA","WI","MT","NE",
"AR","TX","KS"))
```
We can now use `lapply()` to ask R to sort each region, sample three states from each region, and tell us how many states are in each region.
```{r comment="", results='markup'}
lapply(state.list,sort)
lapply(state.list,sample,size=3,replace=FALSE)
sapply(state.list,length)
```
Notice here that we have given names (west, east, and central) to each of the three components of `state.list`. We can ask R to tell us what the names of the `state.list` components are.
```{r comment="", results='hold'}
names(state.list)
```
We can use the double square brackets to extract the western states. Since they are first in the list we use `[[1]]`
```{r comment="", results='hold'}
state.list[[1]]
```
However, this can be dangerous. Are we sure the first component has the western states? A safer approach is to call it by name inside the square brackets.
```{r comment="", results='hold'}
state.list[["west"]]
```
We can also use the `$` to extract a named component from a list.
```{r comment="", results='hold'}
state.list$west
```
The dollar sign in R is going to be extremely important. We will be using it a lot to extract variables, map components, and other values from lists.
You can use the `$` to add new components to a list. Let's add all the postal codes for all of the United States territories.
```{r comment="", results='hold'}
state.list$other <- c("AS","GU","MP","PR","VI","UM","FM","MH","PW")
```
What happens if we ran just the following?
```
other <- c("AS","GU","MP","PR","VI","UM","FM","MH","PW")
```
This creates a separate object called `other`, unconnected to our `state.list`. By using the `$` we add our new collection of states (other) to `state.list`.
We have now created a lot of objects. At any time you can run `ls()` to list all the objects that R has in memory.
```{r comment="", results='hold'}
ls()
```
Assuming you are using R Studio, you can also see the objects stored in memory by clicking on the Environment tab.
## Exercises
`r .exNum('Fix \x60state.list\x60 so that "DC" is in "other" rather than "east"')`. Here are a few hints
+ access "other" using `$`
+ combine things using `c()`
+ assign values using `<-`
+ remove values using `[]` with a negative index or using a logical statement
`r .exNum("Print out east and central states together sorted")`
# Functions
So far you have seen several built-in functions in R, like `max()`, `sample()`, `is.na()`, and `table()`. These functions help us complete tasks that normally would take several lines of R code. They also make it easy to read R code... it's easy to know what `max(c(1,3,5,7,9))` means. In R you can also write your own functions. Let's say we want to just extract the first and last state from each component of `state.list`. Now this is not a particularly useful function, but we're going to use it just for demonstration.
```{r comment="", results='hold'}
give.first.and.last <- function(x)
{
i <- c(1,length(x))
return(x[i])
}
```
As you can see, the basic template of an R function is to give it a new name (here `give.first.and.last()`), followed by the syntax `<- function` (this tells R that what comes next is a function), followed by parentheses containing the names of arguments (you choose what to call them) that will be sent to this function (here we use the not very creative `x`), followed by squiggly braces containing R code to do calculations on `x`, with the last line being `return()` containing whatever final result the function calculates. Our function here creates `i` to contain the number 1 and the length of `x` so that it can figure out where the last value is. Then it simply returns `x[i]`, using the square brackets to pick out the values of `x` indexed by `i`, the first and last values in `x`. Let's try our new function out on the numbers 1 to 100.
```{r comment="", results='hold'}
give.first.and.last(1:100)
```
The primary benefit of writing a function is to simplify the reading of a script. It is much easier to comprehend what a script is doing if you have code that says something like `give.first.and.last()` rather than a bunch of square brackets picking out values. A secondary benefit is that you can use this function again and again to help solve other problems.
Let's combine `give.first.and.last()` with `lapply()` and `sapply()` to extract the first and last state in each component of our list.
```{r comment="", results='markup'}
lapply(state.list, give.first.and.last)
sapply(state.list, give.first.and.last)
```
Note how `sapply()` noticed that `give.first.and.last()` produces exactly two values for each component of the list and went ahead and simplified the result into a 2 by 4 table. Let's first sort the states within each region and then extract the first and last states. This will give us the first and last state in alphabetical order.
```{r comment="", results='markup'}
sapply(lapply(state.list,sort), give.first.and.last)
```
For many functions built into R you can see what they do by typing the name of the function. Here's how R computes the interquartile range of a collection of values.
```{r comment="", results='markup'}
IQR
```
You can see that it computes the 0.25 quantile and the 0.75 quantile and uses `diff()` to compute their difference.
## Exercises
`r .exNum('Make a function \x60is.island(x)\x60 returns \x60TRUE\x60 if \x60x\x60 is an island')`. Islands are "HI", "FM", "MH", "PW", "AS", "GU", "MP", "PR", "VI", "UM". Borrow the template I used for `give.first.and.last()`. Then try using the `%in%` operator
`r .exNum("Count how many islands are within each region. Use an \x60sapply()\x60 (or two) and your new \x60is.island()\x60 function")`
`r .exNum("Which components of \x60b\x60 having missing values? Use \x60is.na()\x60")`. `b` was defined earlier
# Matrices and apply()
A matrix is a collection of values of the same type (all numbers or all text or all logical values) with one or more rows and one or more columns. Let's create a matrix with some random numbers.
```{r comment="", results='hold'}
a <- matrix(sample(1:5,size=12,replace=TRUE),nrow=4)
a
```
This matrix has two dimensions, 4 rows and 3 columns. You can use square brackets to select elements from the matrix.
```{r comment="", results='hold'}
a[1,2] # element in first row, second column
a[1,] # the entire first row
a[,2] # the entire second column
a[-1,-1] # dropping the first row and first column
a[3:4,2:3] # rows 3 & 4, columns 2 & 3
```
The numbers to the left of the comma index rows and the numbers to the right of the comma index columns. The `apply()` function, like the `lapply()` and `sapply()` functions, allows you to apply a function to all the rows or all the columns of a matrix. `apply()` needs the name of the matrix, whether you want to apply the function to the first dimension (rows) or the second dimension (columns), and the name of the function to apply.
```{r comment="", results='hold'}
apply(a, 1, sum) # compute sum of each row
apply(a, 2, sum) # compute sum of each column
apply(a, 1, mean) # compute mean of each row
apply(a, 1, summary) # summarize each row
```
We can also create a new function right on the spot to compute something on each row or column. Let's find the minimum and maximum values in each row and find out if all the values are greater than 1.
```{r comment="", results='hold'}
apply(a, 1, function(x) {c(min(x),max(x))}) # there is also a function range()
apply(a, 1, function(x) {all(x>1)})
```
# Setting the working directory
Now that we have covered a lot of fundamental R features, it is time to load in a real dataset. However, before we do that, R needs to know where to find the data file. So we first need to talk about "the working directory". When you start R, it has a default folder or directory on your computer where it will retrieve or save any files. You can run `getwd()` to get the current working directory. Here's our current working directory, which will not be the same as yours.
```{r comment=""}
getwd()
```
Almost certainly this default directory is *not* where you plan to have all of your datasets and files stored. Instead, you probably have an "analysis" or "project" or "R4crim" folder somewhere on you computer where you would like to store your data and work.
Use `setwd()` to tell R what folder you want it to use as the working directory. If you do not set the working directory, R will not know where to find the data you wish to import and will save your results in a location in which you would probably never look. Make it a habit to have `setwd()` as the first line of every script you write. If you know the working directory you want to use, then you can just put it inside the `setwd()` function.
```
setwd("C:\Users\gridge\Google Drive\R4crim")
```
Note that for all platforms, Windows, Macs, and Linux, the working directory only uses forward slashes. So Windows users be careful... most Windows applications use backslashes, but in an effort to make R scripts work across all platforms, R requires forward slashes. Backslashes have a different use in R that you will meet later.
If you do not know how to write your working directory, here comes R Studio to the rescue. In R Studio click Session -> Set Working Directory -> Choose Directory. Then click through to navigate to the working directory that you want to use. When you find it click "Select Folder". Then look over at the console. R Studio will construct the right `setwd()` syntax for you. Copy and paste that into your script for use later. No need to have to click through the Session menu again now that you have your `setwd()` set up.
Now you can use R functions to load in any datasets that are in your working folder. If you have done your `setwd()` correctly, you shouldn't get any errors because R will know exactly where to look for the data files. If the working directory that you've given in the `setwd()` isn't right, R will think the file doesn't even exist. For example, if you give the path for, say, your R4econ folder, R won't be able to load data because the file isn't stored in what R thinks is your working directory. With that out of the way, let's load a dataset.
# Data frames
A data frame is a special case of a list where all the components of the list have the same number of elements. Think about each component of the list being a "column" in your dataset. R can load in datasets from numerous sources (plain text, Excel files, databases, websites, etc.) including .RData format, R's unique data format. There is an extensive guide to [importing and exporting datasets](https://cran.r-project.org/doc/manuals/r-release/R-data.pdf).
To import data in the .RData format use `load()`. A [sample of Chicago crime data](https://github.com/gregridgeway/R4criminology/blob/master/chicago%20crime%2020141124-20141209.RData) is available on the [R4Crim github site](https://github.com/gregridgeway/R4crim).
```{r comment="", results='hold'}
load("chicago crime 20141124-20141209.RData")
```
List the objects R now has in memory and you will see that there is a new object, `chicagoCrime`.
```{r comment="", results='hold'}
ls()
```
If you did not spell the name of the .RData file exactly correctly, then R will give you an error. A common occurrence when downloading the same file from the web multiple times is for your web browser to add numbers to the multiple versions you've downloaded. So check the file name carefully. Here's what happens when I request a file that doesn't exist.
```{r comment="", results=TRUE, warning=TRUE, error=TRUE}
load("chicago crime.RData")
```
If you get an error like this, then go double check that the file name is spelled exactly and you have correctly set the working directory.
Once you successfully load in a dataset, we can begin to explore it. Let's check that this is indeed a dataset. You can use the `is()` function on any R object to ask it to identify itself.
```{r comment="", results='hold'}
is(chicagoCrime)
```
You can see that `chicagoCrime` is of type `data.frame`... and it is also of type `list`. That means that anything that you can do to lists, like `lapply()` and `sapply()`, you can use on `chicagoCrime` too.
What are the names of the variables in the dataset?
```{r comment="", results='hold'}
names(chicagoCrime)
```
As expected, the data have information the crime date, crime type, location (including latitude and longitude), whether an arrest occurred, and more.
Let's look at some parts of the dataset.
```{r comment="", results='markup'}
# look at the first three rows
chicagoCrime[1:3,]
# look at the first three rows and first three columns
chicagoCrime[1:3,1:3]
# look up by the columns by name
chicagoCrime[1:3,c("Latitude","Longitude")]
```
Ask R what types of values each of the crime features contain.
```{r comment="", results='hold'}
# look at the types of each variable
sapply(chicagoCrime, is)
```
That gives a lot of detailed information. Here's a trick to just get the first value for each one.
```{r comment="", results='hold'}
sapply(chicagoCrime, function(x) is(x)[1])
```
Use `table()` and `sort()` to see what kinds of crimes are in this dataset.
```{r comment="", results='hold'}
# tabulate crimes
sort(table(chicagoCrime$Primary.Type))
sort(table(chicagoCrime$Description))
```
Note how we can use the `$` to extract just the `Primary.Type` and just the `Description` components of the dataset.
Just using `chicagoCrime$District` will give us all the values in that column.
```{r comment="",results='hold'}
chicagoCrime$District
```
If we want only those rows for crimes which happen in Chicago's District 10:
```{r comment="",results='hold'}
chicagoCrime[chicagoCrime$District==10,]
```
If we want only those rows for crimes which happen in Chicago's District 10, but only look at the values in the column `Primary.Type`:
```{r comment="",results='hold'}
chicagoCrime$Primary.Type[chicagoCrime$District==10]
```
What kinds of crimes occur in Chicago's District 10?
```{r comment="", results='hold'}
sort(table(chicagoCrime$Primary.Type[chicagoCrime$District==10]))
```
All these `chicagoCrime$`s are making our code long and harder to read. But we need to tell R to look inside `chicagoCrime` to find `Primary.Type` and `District`. `with()` can greatly simplify R code. Tell R to sort the table as before, but tell R that it can find all of the variables it is looking for in the `chicagoCrime` data frame.
```{r comment="", results='hold'}
with(chicagoCrime, sort(table(Primary.Type[District==10])))
```
Much easier to read and understand!
## Exercises
`r .exNum("Display three randomly selected rows")`
`r .exNum("Count \x60NA\x60s in each column")`
`r .exNum("Look up \x60Location.Description\x60, \x60Block\x60, \x60Beat\x60, and \x60Ward\x60 for those missing \x60Latitude\x60")`
# For loops
Sometimes we need to have R repeat certain tasks multiple times, such as marching through each row of a dataset and modifying values. For loops accomplish this. Later in this course we will be using Google Maps to extract information about addresses. So we might need to iterate through every row in the dataset, check whether the latitude and longitude are missing, and if missing try to retrieve the latitude and longitude from Google Maps. The last crime in the dataset missing coordinates is in row 9954.
```{r comment="", results='hold'}
chicagoCrime[9954,]
```
While the coordinates are missing, the street address, 081XX S THROOP ST, is (mostly) there. Chicago PD has masked the last two digits of the address so that we really only know the location down to the nearest block. Let's look up 8150 S Throop St, likely near the middle of the block, to see where this is. The Google Maps URL is [https://www.google.com/maps/place/8150+S+Throop+St,+Chicago,+IL](https://www.google.com/maps/place/8150+S+Throop+St,+Chicago,+IL). It would be a pain to have type out each of these URLs for every address that we wanted to look up. So let's learn a little bit about for loops to see how this might work.
Here is a basic for loop that runs through the numbers 1 to 10 and prints them out one at a time.
```{r comment="", results='hold'}
for(i in 1:10)
{
print(i)
}
```
Note the basic structure. There's the keyword `for`. Inside the parentheses is a variable `i` (but you can use any variable name you want), the keyword `in`, and finally a collection of values, in this case the numbers 1 to 10. The for loop will march through this collection of values, assigning `i` each value in turn, and running the code inside the squiggly braces. So first `i` will be set to 1 and the `print()` function will print the value 1 to the screen. When that is done, `i` will take the next value in the collection, a 2, and the for loop will run the `print()` function will print the number 2. This continues until `i` takes the value 10 and `print()` prints that 10 to the screen.
Let's loop through all the states, printing out which number they are in the collection along with the state postal code.
```{r comment="", results='hold'}
for(i.state in 1:length(state.names))
{
print(c(i.state,state.names[i.state]))
}
```
Let's loop through all the letters of the alphabet and see if that letter is in the word "CRIME". `cat()` is like `print()`, but just dumps to the screen exactly what you give it^[Why "cat" you ask? Programmers in the early 1970s created a program called "cat" to concatenate files together, but most uses of "cat" were to just dump file contents to the screen or to some other program.]. `print()` will do some formatting to try to present the results a little nicer.
```{r comment="", results='hold'}
for(letter in c("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O",
"P","Q","R","S","T","U","V","W","X","Y","Z"))
{
print(letter)
if(letter %in% c("C","R","I","M","E"))
cat("The letter",letter,"is in the word 'CRIME'\n")
}
```
Actually, R has a built in collection, `LETTERS`, that contains all of the capital letters. There really was no need to type them all out. This works too.
```{r comment="", results='hold'}
for(letter in LETTERS)
{
print(letter)
if(letter %in% c("C","R","I","M","E"))
cat("The letter",letter,"is in the word 'CRIME'\n")
}
```
Let's loop through the states and check whether each one is an island or not.
```{r comment="", results='hide', echo=FALSE}
is.island <- function(x)
{
islands <- c("HI","FM","MH","PW","AS","GU","MP","PR","VI","UM")
return(x %in% islands)
}
```
```{r comment="", results='hold'}
for(nm.state in state.names)
{
print(nm.state)
if(is.island(nm.state))
cat(nm.state," is an island\n")
}
```
Let's get back to our original problem of having R construct all the Google Map URLs that we need. First, we will create a new variable in the dataset called `google.maps.url` and fill it with empty text.
```{r comment="", results='hold'}
chicagoCrime$google.maps.url <- ""
```
Now let's loop through all 10,000 rows in the dataset. First, R will use `gsub()` to replace the XX in the house number with 50, so we get the location in the middle of the block. `gsub()` is like a Find-and-Replace function, but way more powerful and flexible. We will use it extensively when covering regular expressions. After fixing the house number, we use `paste()` to assemble a URL suitable for looking up addresses on Google Maps.
```{r comment="", results='hold'}
time4ForLoop <- system.time( # system.time() is like a stop watch
for(i in 1:nrow(chicagoCrime))
{
a <- gsub("XX", "50", chicagoCrime$Block[i])
chicagoCrime$google.maps.url[i] <- paste("https://www.google.com/maps/place/",
a,
",+Chicago,+IL",sep="")
}
)
```
Note that we've wrapped the for loop with a call to `system.time()`. This will keep the time on how long this for loop takes. When creating these notes on a laptop it took `r time4ForLoop[3]` seconds. Not bad. Much faster than having to type out these 10,000 URLs. However, if we had one million addresses, then this code is going to take much more time.
In fact, in R for loops are *very* slow. They are so slow that R programmers attempt to avoid them whenever possible. We can actually accomplish the same task without using a for loop. `gsub()` will accept a whole collection of addresses and modify them all at once. `paste()` also will accept a collection of text values and paste them together with the other parts.
```{r comment="", results='hold'}
timeWithoutForLoop <- system.time(
{
a <- gsub("XX","50",chicagoCrime$Block)
chicagoCrime$google.maps.url <- paste("https://www.google.com/maps/place/",
a,
",+Chicago,+IL",sep="")
}
)
```
This took `r timeWithoutForLoop[3]` seconds. That's `r round(time4ForLoop[3]/timeWithoutForLoop[3],1)` times faster than the for loop.
## Exercises
`r .exNum('Use a for loop to create a variable \x60Coordinates\x60 that looks like "(X.Coordinate,Y.Coordinate)"')`
+ Use `paste()` with the `X.Coordinate` and `Y.Coordinate` variables
+ Remember the `sep=` option in `paste()`
+ You might find using the `with()` function to simplify your code and avoid having a lot of `chicagoCrime$`s
`r .exNum("Redo the previous exercise without using a for loop and compare computation time")`
# More tabulating, aggregating, and breaking statistics down by group
The variable `Arrest` indicates whether someone was arrested for the crime. Here are the first 10 values.
```{r comment="", results='hold'}
chicagoCrime$Arrest[1:10]
```
We can compute the percentage of crimes with an arrest by calculating how often on average `Arrest=="true"`.
```{r comment="", results='hold'}
mean(chicagoCrime$Arrest=="true")
```
The `aggregate()` function will do this same calculation, but has options for breaking it down by some other crime feature. Let's use `aggregate()` to compute the percentage of crimes with an arrest by ward. We store the result in `a`.
```{r comment="", results='hold'}
a <- aggregate((Arrest=="true")~Ward, data=chicagoCrime, mean)
a
```
The first part of `aggregate()` gives an R formula for how we want the data broken up. On the left of the `~` is the outcome or feature that we want to study. Here it is whether or not `Arrest` has value true. To the right of the `~` is the feature by which we want to break down the arrests, ward in this case. Then we need to tell `aggregate()` in which data frame it can find `Arrest` and `Ward`. Lastly, we need to tell `aggregate()` what to do with the outcome we are studying. Here we are asking `aggregate()` to compute the mean so that we get an arrest percentage.
As a result, we have a dataframe of two columns. In the left column, we have the ward number. In the right column, we have the fraction of crimes that result in an arrest: `Arrest=="true"`.
We can use `barplot()` to compare arrest percentages by ward.
```{r comment="", fig.width=6.5}
barplot(a$`(Arrest == "true")`,
names.arg = a$Ward,
cex.names = 0.5,
ylab = "Fraction arrested",
xlab = "Ward")
```
Note that the column in `a` containing the arrest fraction has a complicated name with several special symbols like `==` and `"`. R will get very confused unless we "protect" this variable name with the backquotes (also called backticks). You can visit the help for `barplot()` with `?barplot` to learn what all the arguments do.
Frequently we will focus on just a subset of the data. For example, we might just want to study assaults rather than all crimes. The `subset()` function does this for us like `subset(data, Primary.Type=="ASSAULT")`. This is particularly useful to use in combination with `with()`. Let's create a table of the number of arrests by ward, but only for assaults.
```{r comment="", results='hold'}
with(subset(chicagoCrime,Primary.Type=="ASSAULT"),
table(Arrest,Ward))
```
Let's recreate our barplot, but now just using assaults.
```{r comment="", fig.width=6.5}
a <- aggregate((Arrest=="true")~Ward,
data=subset(chicagoCrime,Primary.Type=="ASSAULT"),
mean)
barplot(a$`(Arrest == "true")`,
names.arg = a$Ward,
cex.names = 0.5,
ylab = "Fraction arrested",
xlab = "Ward",
main = "Arrest fraction for assaults")
```
## Exercises
`r .exNum('How many assaults occurred in the street? (\x60Location.Description=="STREET"\x60)')`. Try using `subset()` even though there are other ways
`r .exNum("What percentage of assaults occurred in the street by Ward?")`
# Plotting Data
R enables us to plot points. The points we plotted form the shape of Chicago... which makes total sense because we're using Chicago crime data.
```{r comment="", fig.width=6.5}
plot(Latitude~Longitude, data=chicagoCrime)
```
The `plot()` function here uses the same R formula syntax as the `aggregate()` function. The variable on the left of `~` is the outcome, plotted on the y-axis, and the variable on the right appears on the x-axis. And, of course, we need to tell `plot()` that it can find these variables inside the `chicagoCrime` data frame.
Let's plot the district with the most crime. The first line here tabulates how many crimes occurred in each district, sorts those counts, reverse the sorted list so that the largest one comes first, extracts the first one in the collection using `[1]` and then uses `names()` to extract the name of the district (rather than how many crimes occurred in that district). You can see all of District 8's crimes (that's the district with the most crimes) appearing as red points in the plot.
```{r comment="", fig.width=6.5}
# selects district 8, with 713 crimes
max.district <- names(rev(sort(table(chicagoCrime$District)))[1])
plot(Latitude~Longitude,
data=subset(chicagoCrime, District!=max.district), # not in District 8
pch=".", # plot with tiny dot
xlab="Longitude",ylab="Latitude")
points(Latitude~Longitude,
data=subset(chicagoCrime, District==max.district), # in District 8
pch=".",
col="red")
```
R tries to set up default graphics settings so that most plots look okay, but sometimes it takes a little more work to adjust them. The good thing is that R lets you adjust everything. So let's make a barplot of the number of crimes of each type.
```{r comment="", fig.width=6.5}
barplot(table(chicagoCrime$Primary.Type))
```
The labels on the bars are so long that only a few of them appear. So let's spend a little more time, write a few more lines of R code, and make this plot look right.
```{r comment="", fig.width=6.5}
tab <- table(chicagoCrime$Primary.Type) # tabulate crime counts
# give 2.5in on the left margin to give lots of space for the crime type labels
par(pin=c(6.5,6), # set plot dimensions (inches)
mai=c(1.02, 2.5, 0, 0.3)) # set plot margins
a <- barplot(tab,
col="salmon", # change the bars' color
horiz=TRUE, # make the bars horizontal
names.arg=rep("",nrow(tab)), # put no labels on the bars
xlab="Number of crimes")
# add the bar labels on the y-axis
axis(2, # set up the y-axis label (axis #2)
at=a[,1], # midpoints of bars stored in a[,1]
cex.axis=0.7, # shrink the axis text size by 30%
labels=names(tab), # the bar labels
las=1, # make labels horizonal (see ?par)
tick=FALSE) # no tick marks on the axis
# add the actual number on the bars
text(ifelse(tab<80, 180, tab-5), # x-coord of text,
# if bar too small, put text to right
a[,1], # y-coord of text, midpoint of bars
tab, # text to add to the plot
cex=0.7, # shrink text (cex=character expansion)
adj=1) # right justify text
```
## Exercises
`r .exNum("Make a barplot indicating how many states are in each region. Use \x60state.list\x60")`
`r .exNum("Identify the beat with the most crimes")`
`r .exNum("Identify the beat with the most domestic violence incidents")`
`r .exNum("Part 1 crimes are homicide, robbery, assault, arson, burglary, theft, sex offense, motor vehicle theft. Calculate the number of Part 1 crimes in Chicago")`
# Solutions to the exercises
1. `r .exerciseQuestions[1]`
```{r comment=""}
(1:49)*2
```
or
```{r comment=""}
seq(2,98,by=2)
```
2. `r .exerciseQuestions[2]`
```{r comment=""}
mean((1:49)*2)
```
3. `r .exerciseQuestions[3]`
```{r comment=""}
sort(c("WA","DC","CA","PA","MD","VA","OH"))
```
4. `r .exerciseQuestions[4]`
```{r comment=""}
state.names[51]
```
5. `r .exerciseQuestions[5]`
```{r comment=""}
state.names[c(7,8,21,24,28,32,35,46)]
```
or sort first so that all the M states are together
```{r comment=""}
sort(state.names)[20:27]
```
Here's another possible answer that uses `substring` (which we haven't covered yet):
```{r comment=""}
state.names[substring(state.names, 1, 1)=="M"]
```
6. `r .exerciseQuestions[6]`
Of course, these may vary depending on where you have lived.
```{r comment=""}
state.names[c(1, 4, 10, 26)]
```
7. `r .exerciseQuestions[7]`
```{r comment=""}
sort(state.names)[51]
```
or
```{r comment=""}
rev(sort(state.names))[1]
```
8. `r .exerciseQuestions[8]`
```{r comment=""}
rev(sort(state.names))[1:3]
```
9. `r .exerciseQuestions[9]`
```{r comment=""}
my.states <- c("PA", "NJ", "NY", "MD", "DE", "MA", "RI", "CT", "ME", "LA", "IN")
state.names %in% my.states
```
10. `r .exerciseQuestions[10]`
```{r comment=""}
a <- 1:100
a[a %% 2==1 & a>50 & a<75]
```
11. `r .exerciseQuestions[11]`
```{r comment=""}
state.names[state.names>"LZ" & state.names<"N"]
```
12. `r .exerciseQuestions[12]`
```{r comment=""}
a <- sample(1:6, size=100000, replace=TRUE)
table(a)[6]/length(a)
```
Or
```{r comment=""}
sum(a==6)/length(a)
```
Or
```{r comment=""}
mean(a==6)
```
13. `r .exerciseQuestions[13]`
```{r comment=""}
dice1 <- sample(1:6, size=1000, replace=TRUE)
dice2 <- sample(1:6, size=1000, replace=TRUE)
doubleroll <- dice1 + dice2
mean(doubleroll==7) # should be close to 1/6 or 0.1666...
```
14. `r .exerciseQuestions[14]` (Answers will vary)
```{r comment=""}
sample(state.names, size=5, replace=FALSE)
```
15. `r .exerciseQuestions[15]`
+ Tabulate how often each state was selected (Answers will vary)
```{r comment=""}
a <- sample(state.names, size=1000, replace=TRUE)
table(a)
```
+ Which state was selected the least? (Answers will vary)
```{r comment=""}
sort(table(a))[1]
```
16. `r .exerciseQuestions[16]`
```{r comment=""}
state.list$east <- state.list$east[state.list$east!="DC"]
state.list$other <- c(state.list$other, "DC")
state.list
```
Or
```{r comment=""}
state.list$east <- setdiff(state.list$east, "DC")
state.list$other <- c(state.list$other, "DC")
state.list
```
17. `r .exerciseQuestions[17]`
```{r comment=""}
sort(c(state.list$east, state.list$central))
```
Or
```{r comment=""}
with(state.list, sort(c(east, central)))
```
18. `r .exerciseQuestions[18]`
```{r comment=""}
is.island <- function(x)
{
return(x %in% c("HI", "FM", "MH", "PW", "AS", "GU", "MP", "PR", "VI", "UM"))
}
```
19. `r .exerciseQuestions[19]`
First, this `lapply()` asks each state if they are an island.
```{r comment=""}
lapply(state.list, is.island)
```
Now we want to count up how many `TRUE`s there are in each component, so wrap this `lapply()` with an `sapply()`
```{r comment=""}
sapply(lapply(state.list, is.island), sum)
```
20. `r .exerciseQuestions[20]`
```{r comment=""}
sapply(lapply(b, is.na), any)
```
Or