01 Intro to R.Rmd

---
title: "Introduction to R"
author:
- affiliation: University of Pennsylvania
  email: gridge@upenn.edu
  name: Greg Ridgeway
- affiliation: University of Pennsylvania
  email: moyruth@upenn.edu
  name: Ruth Moyer
- affiliation: University of Pennsylvania
  email: gohl@sas.upenn.edu
  name: Li Sian Goh
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  html_document:
    css: htmlstyle.css
---
<!-- HTML YAML header Ctrl-Shift-C to comment/uncomment -->

<!-- --- -->
<!-- title: "Introduction to R" -->
<!-- author: -->
<!-- - Greg Ridgeway (gridge@upenn.edu) -->
<!-- - Ruth Moyer (moyruth@upenn.edu) -->
<!-- date: "`r format(Sys.time(), '%B %d, %Y')`" -->
<!-- output: -->
<!--   pdf_document: -->
<!--     latex_engine: pdflatex -->
<!--   html_document: default -->
<!-- fontsize: 11pt -->
<!-- fontfamily: mathpazo -->
<!-- --- -->
<!-- PDF YAML header Ctrl-Shift-C to comment/uncomment -->


<!-- A function for automating the numbering and wording of the exercise questions -->
```{r echo=FALSE}
.counterExercise <- 0
.exerciseQuestions <- NULL
.exNum <- function(.questionText="") 
{
   .counterExercise <<- .counterExercise+1
   .exerciseQuestions <<- c(.exerciseQuestions, .questionText)
   .questionText <- gsub("@@", "`", .questionText)
   return(paste0(.counterExercise,". ",.questionText))
}
```

# Introduction

This is the first set of notes for an introduction to R programming from criminology and criminal justice. These notes assume that you have the latest version of R and R Studio installed. We are also assuming that you know how to start a new script file and submit code to the R console. From that basic knowledge about using R, we are going to start with `2+2` and by the end of this set of notes you will load in a small Chicago crime dataset, create a few plots, count some crimes, and be able to subset the data. Our aim is to build a firm foundation on which we will build throughout this set of notes.

R sometimes provides useful help as to how to do something, such as choosing the right function or figuring what the syntax of a line of code should be. Let's say we're stumped as to what the `sqrt()` function does. Just type `?sqrt` at the R prompt to read documentation on `sqrt()`. Most help pages have examples at the bottom that can give you a better idea about how the function works. R has over 7,000 functions and an often seemingly inconsistent syntax. As you do more complex work with R (such as using new packages), the Help tab can be useful. 

# Basic Math and Functions in R

R, on a very unsophisticated level, is like a calculator. 

```{r comment="", results='hold'}
2+2
1*2*3*4
(1+2+3-4)/(5*7)
sqrt(2)
(1+sqrt(5))/2 # golden ratio
2^3
log(2.718281828)
round(2.718281828,3)
12^2 
factorial(4)
abs(-4)
```

# Combining values together into a collection (or vector)

We will use the `c()` function a lot. `c()` *c*ombines elements, like numbers and text to form a vector or a collection of values. If we wanted to combine the numbers 1 to 5 we could do
```{r comment=""}
c(1,2,3,4,5)
```
With the `c()` function, it's important to separate all of the items with commas. 

Conveniently, if you want to add 1 to each item in this collection, there's no need to add 1 like `c(1+1,2+1,3+1,4+1,5+1)`... that's a lot of typing. Instead R offers the shortcut
```{r comment=""}
c(1,2,3,4,5)+1
```
In fact, you can apply any mathematical operation to each value in the same way.
```{r comment="", results='hold'}
c(1,2,3,4,5)*2
sqrt(c(1,2,3,4,5))
(c(1,2,3,4,5)-3)^2
abs(c(-1,1,-2,2,-3,3))
```

Note in the examples below that you can also have a collection of non-numerical items. When combining text items, remember to use quotes around each item.
```{r comment="", results='hold'}
c("CRIM600","CRIM601","CRIM602","CRIM603")
c("yes","no","no",NA,NA,"yes")
```
In R, `NA` means a missing value. We'll do more exercises later using data containing some `NA` values. In any dataset, you're virtually guaranteed to find some NAs. The function `is.na()` helps determine whether there are any missing values (any NAs). In some of the problems below, we'll use `is.na()`.

You can use double quotes or single quotes in R as long as you are consistent. When you have quotes inside the text you need to be particularly careful.
```{r comment="", results='hold'}
"Lou Gehrig's disease"
'The officer shouted "halt!"'
```
The backslashes in the above text "protect" the double quote, communicating to you and to R that the next double quote is not the end of the text, but a character that is actually part of the text you want to keep.

The `c()` function isn't the only way to make a collection of values in R. For example, placing a `:` between two numbers can return a collection of numbers in sequence. The functions `rep()` and `seq()` produce repeated values or sequences.
```{r comment="", results='hold'}
1:10
5:-5
c(1,1,1,1,1,1,1,1,1,1)
rep(1,10)
rep(c(1,2),each=5)
seq(1, 5)
seq(1, 5, 2)
```

R will also do arithmetic with two vectors, doing the calculation pairwise. The following will compute 1+11 and 2+12 up to 10+20.
```{r comment=""}
1:10 + 11:20
```
Yet, other functions operate on the whole collection of values in a vector. See the following examples:
```{r comment="", results='hold'}
sum(c(1,10,3,6,2,5,8,4,7,9)) # sum
length(c(1,10,3,6,2,5,8,4,7,9)) # how many?
cumsum(c(1,10,3,6,2,5,8,4,7,9)) # cumulative sum
mean(c(1,10,3,6,2,5,8,4,7,9)) # mean of collection of 10 numbers
median(c(1,10,3,6,2,5,8,4,7,9)) # median of same population
```
There are also some functions in R that help us find the biggest and smallest values. For example:
```{r comment="", results='hold'}
max(c(1,10,3,6,2,5,8,4,7,9)) # what is the biggest value in vector?
which.max(c(1,10,3,6,2,5,8,4,7,9)) # in which "spot" would we find it?
min(c(1,10,3,6,2,5,8,4,7,9)) # what is the smallest value in vector?
which.min(c(1,10,3,6,2,5,8,4,7,9)) # in which "spot" would we find it?
```
A lot of functions in R are to help you see and understand what's in a dataset. For example, we can rearrange a collection of values in ascending or descending order. Note the `order()` function.  How is it similar to the `which.max()` or `which.min()` function?  Note the `sort()` function.
```{r comment="", results='asis'}
sort(c(1,10,3,6,2,5,8,4,7,9))
rev(c(1,10,3,6,2,5,8,4,7,9))
rev(sort(c(1,10,3,6,2,5,8,4,7,9)))
sort(c(1,10,3,6,2,5,8,4,7,9),decreasing=TRUE)
order(c(1,10,3,6,2,5,8,4,7,9))#   where is the ith biggest number?
rank(c(1,100,3,20)) #how does each value rank compared to others?
```

The above examples have involved mostly numerical values in a vector. Here are some examples involving non-numerical "character" values. Let's create an object called `my.states` (a name I made up) that will contain the postal codes of places in which I've lived or worked.
```{r comment="", results='hold'}
my.states <- c("WA","DC","CA","PA","MD","VA","OH")
```
Take a look at the arrow `<-` (pronounced 'gets'). This is how you tell R to take the result of what is on the right and store it in an object named on the left. We're going to talk more about this arrow soon. Now let's run some new functions on this collection of postal codes.
```{r comment="", results='hold'}
nchar(my.states)
paste(my.states, ", USA")
paste(my.states, ", USA", sep="")
paste0(my.states, ", USA")
paste(my.states, collapse=",")
```
The `nchar()` function counts how many characters are in each character string. The `paste()` function pastes character strings together. By default, `paste()` puts a space betweeen the strings being pasted together. It looks strange with that space after WV in "WV , USA". We can set the separator to be nothing (the empty string) by setting `sep=""`. `paste0()` is a shortcut for pasting with `sep=""`. Setting `collapse=","` combines all the text together, collapsing them into one string with a comma as a separator.

## Exercises
`r .exNum("Print all even numbers less than 100")`
`r .exNum("What is the mean of even numbers less than 100")`
`r .exNum('Have R put in alphabetical order \x60c("WA","DC","CA","PA","MD","VA","OH")\x60')` 

# Assignment of values to variables

The left-facing arrow symbol is an extremely important tool in R. Try the following:
```{r comment="", results='hold'}
a <- 1
```
Now type:
```{r comment="", results='hold'}
a
```

R has assigned a the value of "1" - here are more examples:
```{r comment="", results='hold'}
b <- 2+2
a <- a+b
a <- 1:10
b <- 2*a
a+b
sd(a)
state.names <- c("WV","OH","OK","NV","CA","IN","MA","MI","IL","IA","SC","NH",
                 "LA","GA","CT","WI","CO","NY","UT","AK","MS","AL","OR","MT",
                 "ND","WY","FL","ME","AZ","TN","PA","MN","NM","SD","MO","RI",
                 "HI","WA","DE","NJ","NE","KY","AR","TX","NC","MD","VA","VT",
                 "KS","ID","DC")
```
R programmers typically pronounce the `<-` as "gets". So we would read `a <- 1` as "a gets one".

# Indexing 

We can extract items from a vector, matrix, or data frame using indexing. In R, we use square brackets to index. 

```{r comment="", results='hold'}
state.names[1] # get the first state
state.names[1:3] # get the first three states
state.names[c(1,5,9)] # get states 1, 5, and 9
state.names[2*(1:25)] # get the even states
```

If you put a negative number inside the `[]`, this will communicate to R to remove that item from the collection. Let's remove DC from `state.names` since it is not one of the 50 states. Since it is the 51st item in `state.names` we can remove it like this 
```{r comment="", results='hold'}
state.names[-51]
```

Let's combine the sort and order functions from above (along with variable assignment) with the concept of indexing. 

```{r comment="", results='hold'}
sort(state.names)[1] # sort, then give the first value
i <- order(state.names) # index the states in order
i[1:3]                  # which positions are the first three
state.names[i[1:3]]     # show me those three states
```
Note that in the last example we used square brackets within square brackets. First, we asked R to give us the indices of the first three states in alphabetical order and that was `r i[1:3]`. Then R took those three values and plugged them into the second set of square brackets to show you the state names in those positions in the collection.

## Exercises
`r .exNum("What's the last state in the \x60state.names\x60?")`
`r .exNum('Pick out states that begin with "M" using their indices')`
`r .exNum("Pick out states where you have lived")`
`r .exNum("What's the last state in alphabetical order?")`
`r .exNum("What are the last three states in alphabetical order?")`


# Logical values and operations
Logical values in R are the two values `TRUE` and `FALSE`, always written in all capital letters in R. You can also combine a bunch of `TRUE` and `FALSE` values into a collection.
```{r comment="", results='hold'}
TRUE
FALSE
c(TRUE,FALSE,TRUE,FALSE)
```
We use logical operators to create logical expressions and R can evaluate them as either `TRUE` or `FALSE`. For example, `&` represents the logical "and" and `|` represents the logical "or."
```{r comment="", results='hold'}
TRUE  & TRUE
FALSE & TRUE
FALSE | TRUE
FALSE | FALSE
```
We can use R to compare values using greater than or less than symbols. We can also express "greater than or equal to" or "less than or equal to." These will evaluate to `TRUE` or `FALSE` depending, of course, on whether the statement is true or false.
```{r comment="", results='hold'}
6>5
6<5
6>=5
5<=5
```
We can combine logical operators into more complicated expressions.
```{r comment="", results='hold'}
(6>5) | (100<3)
(6>5) & (100<3)
```

Here are some additional examples. We are going to make `a` be the values 1 to 10 and then use logical operators to ask a question (like "are you equal to?" or "are you smaller than?") of each of those values. Note that the double equal sign `==` asks the question whether the two values are the same. 
```{r comment="", results='hold'}
a <- 1:10
a==5
a!=5  # ! means "not"
a<5
a>=5
a>5 & a<8 
a<3 | a>=7
```

The `%%` operator computes the remainder after dividing the left side by the right side.
```{r comment="", results='hold'}
13 %% 5      # = 3, 13/5 = 2 with remainder 3
a %% 2 == 0  # here's a way to ask each number if it's even
```


There are special functions `any()` and `all()` that check whether all/any of the values are true.
```{r comment="", results='hold'}
all(a<11)
all(a>5 & a<8)
any(a>5 & a<8)
```

Logical values may be used inside square brackets too. R will show you the values corresponding to `TRUE`s inside the square brackets and will eliminate any values corresponding to `FALSE`s. For example, let's store in `i` `TRUE` for even numbers and `FALSE` for odd numbers. So `i` will consist of ten logical values. Putting `i` inside the square brackets will extract just the values of `a` for which `i` has a `TRUE`.
```{r comment="", results='hold'}
i <- a%%2==0
i
a[i]
```
We can use `!`, which means "not," to reverse all the logical values and get the values of `a` that are not even.
```{r comment="", results='hold'}
a[!i]
```

Before, we removed DC from the list of states by noticing that it was in position #51. This time, let's have R do the work of locating DC in the collection of states. We'll have R ask each element in `state.names` whether or not it equals "DC".
```{r comment="", results='hold'}
i <- state.names!="DC"
state.names[i]
state.names[state.names!="DC"] # can also put directly inside []
```

The R operator `%in%` asks each value on the left whether or not it is a member of the set on the right.
```{r comment="", results='hold'}
a %in% c(3,7,10)

my.states <- c("MD","OH","VA","CA","WA","DC")
# do the above states touch the Pacific Ocean? (Make a list of states that touch the Pacific Ocean and compare with my.states)
my.states %in% c("CA","OR","WA","AK","HI")
# how many of these states touch the Pacific Ocean?
sum(my.states %in% c("CA","OR","WA","AK","HI"))
```
Note in the last line we used `sum()` to count for how many of the elements in `my.states` did `%in%` evaluate to be `TRUE`.

## Exercises
`r .exNum("Report \x60TRUE\x60 or \x60FALSE\x60 for each state depending on if you have lived there")`
`r .exNum("With \x60a <- 1:100\x60, pick out odd numbers between 50 and 75")`
`r .exNum("Use greater than less than signs to get all state names that begin with M")`

# Sampling
The function `sample()` randomly shuffles a collection of values.
```{r comment="", results='hold'}
sample(1:10) # each time different values will appear
sample(1:10)
sample(1:10)
a <- sample(1:1000,size=10) # pick 10 numbers between 1-1000
a <- sample(1:6,size=1000,replace=TRUE) # roll a die 1000 times
```
Notice that `sample()` has several options including `size=` to indicate how many to select and `replace=` to indicate whether to sample with or without replacement. You can access the help on the `sample()` function by typing `?sample` at the R prompt.

# Tabulating

The `table()` function counts how many of each value appear in a collection. We just set `a` to be a random collection of numbers 1 to 6, simulating rolling a die. With `table()` we can see how often each number appeared. 
```{r comment="", results='hold'}
table(a)
max(table(a)) # find out which value appears most frequently
```
## Exercises
`r .exNum("Use \x60sample()\x60 to estimate the probability of rolling a 6")`
`r .exNum("Use \x60sample()\x60 to estimate the probability that the sum of two die equal 7")`
`r .exNum("Use \x60sample()\x60 to select randomly five states without replacement")`
`r .exNum("Use \x60sample()\x60 to select randomly 1000 states with replacement")`
    + Tabulate how often each state was selected
    + Which state was selected the least? Make R do this for you

# Lists 

So far we have worked with very simple collections of numbers or text or logical values. Eventually we will need to work with more complicated kinds of data, like datasets, maps, and other objects. R stores these more complex objects in a list. A list is essentially a collection of objects, potentially of different types. Let's start with a simple list.
```{r comment="", results='hold'}
a <- list(1:3,5:1,1:10)
a
```
The list `a` has three components, each of which is a collection of values and each has different length. Here's another list consisting of three components, each of which is a collection of different types, numeric, text, and logical values.
```{r comment="", results='hold'}
b <- list(0:9, c("A","B","C"),c(TRUE,FALSE,NA))
b
```
We use a double set of square brackets to access the components of a list. Let's say we just want the first component of `a`, just the part with the numbers 1, 2, and 3.
```{r comment="", results='hold'}
a[[1]]
```
We can even grab the first element in the first component of the list `a`.
```{r comment="", results='hold'}
a[[1]][1]
```
Or we just select the first and third component of the list `a`. This will return a new list, but just without the second component.
```{r comment="", results='hold'}
a[c(1,3)]
```

`lapply()` means "list apply" and lets us apply a given function to every item in a list and obtain a list in return. Let's say we want to sort each of the components in `a`. It would take too much typing to run `sort(a[[1]])` and `sort(a[[2]])` and `sort(a[[3]])`. Instead, `lapply()` can apply the sort function to each of the three components in `a`.
```{r comment="", results='hold'}
lapply(a,sort)
```
There is also a function `sapply()` that works in a manner quite similar to `lapply()`. The only difference is that `sapply()` will try to simplify the results. Think about the "s" meaning "simplified". Let's compute the number of elements in each component and the average of the numbers in each component.
```{r comment="", results='hold'}
sapply(a,length)
sapply(a,mean)
```
Since `length()` and `mean()` will return a single number for each component, the result can be simplified into a collection of three values, one for each component of the list.

Let's find the component that has the most values in it.
```{r comment="", results='hold'}
i <- which.max(sapply(a,length))
a[[i]]
```
If `sapply()` is not able to simplify the result, then the result is just like `lapply()`.
```{r comment="", results='hold'}
sapply(a,sort)
```

Let's return to our state example. Before we just had a collection of 51 postal codes. Instead, let's create a list that separates them into three components depending on whether they are in the west, east, or central United States.
```{r comment="", results='hold'}
state.list <- list(
   west=c("AK","HI","WA","NV","CA","CO","UT","OR","AZ","NM","ID"),
   east=c("KY","RI","PA","DE","DC","NJ","WV","MA","SC","NH","GA","CT","NY","IN",
          "MS","AL","OH","NC","MD","VA","VT","FL","ME","TN"),
   central=c("SD","MO","MN","ND","WY","OK","MI","IL","IA","LA","WI","MT","NE",
             "AR","TX","KS"))
```

We can now use `lapply()` to ask R to sort each region, sample three states from each region, and tell us how many states are in each region.
```{r comment="", results='markup'}
lapply(state.list,sort)
lapply(state.list,sample,size=3,replace=FALSE)
sapply(state.list,length)
```

Notice here that we have given names (west, east, and central) to each of the three components of `state.list`. We can ask R to tell us what the names of the `state.list` components are.
```{r comment="", results='hold'}
names(state.list)
```

We can use the double square brackets to extract the western states. Since they are first in the list we use `[[1]]`
```{r comment="", results='hold'}
state.list[[1]]
```
However, this can be dangerous. Are we sure the first component has the western states? A safer approach is to call it by name inside the square brackets.
```{r comment="", results='hold'}
state.list[["west"]]
```

We can also use the `$` to extract a named component from a list. 
```{r comment="", results='hold'}
state.list$west

```
The dollar sign in R is going to be extremely important. We will be using it a lot to extract variables, map components, and other values from lists.

You can use the `$` to add new components to a list. Let's add all the postal codes for all of the United States territories.
```{r comment="", results='hold'}
state.list$other <- c("AS","GU","MP","PR","VI","UM","FM","MH","PW")

```

What happens if we ran just the following?
```
other <- c("AS","GU","MP","PR","VI","UM","FM","MH","PW")
```
This creates a separate object called `other`, unconnected to our `state.list`. By using the `$` we add our new collection of states (other) to `state.list`.

We have now created a lot of objects. At any time you can run `ls()` to list all the objects that R has in memory.
```{r comment="", results='hold'}
ls()

```
Assuming you are using R Studio, you can also see the objects stored in memory by clicking on the Environment tab.

## Exercises
`r .exNum('Fix \x60state.list\x60 so that "DC" is in "other" rather than "east"')`. Here are a few hints
     + access "other" using `$`
     + combine things using `c()`
     + assign values using `<-`
     + remove values using `[]` with a negative index or using a logical statement
`r .exNum("Print out east and central states together sorted")`


# Functions
So far you have seen several built-in functions in R, like `max()`, `sample()`, `is.na()`, and `table()`. These functions help us complete tasks that normally would take several lines of R code. They also make it easy to read R code... it's easy to know what `max(c(1,3,5,7,9))` means. In R you can also write your own functions. Let's say we want to just extract the first and last state from each component of `state.list`. Now this is not a particularly useful function, but we're going to use it just for demonstration. 
```{r comment="", results='hold'}
give.first.and.last <- function(x)
{
   i <- c(1,length(x))
   return(x[i])
}
```
As you can see, the basic template of an R function is to give it a new name (here `give.first.and.last()`), followed by the syntax `<- function` (this tells R that what comes next is a function), followed by parentheses containing the names of arguments (you choose what to call them) that will be sent to this function (here we use the not very creative `x`), followed by squiggly braces containing R code to do calculations on `x`, with the last line being `return()` containing whatever final result the function calculates. Our function here creates `i` to contain the number 1 and the length of `x` so that it can figure out where the last value is. Then it simply returns `x[i]`, using the square brackets to pick out the values of `x` indexed by `i`, the first and last values in `x`. Let's try our new function out on the numbers 1 to 100.

```{r comment="", results='hold'}
give.first.and.last(1:100)
```
The primary benefit of writing a function is to simplify the reading of a script. It is much easier to comprehend what a script is doing if you have code that says something like `give.first.and.last()` rather than a bunch of square brackets picking out values. A secondary benefit is that you can use this function again and again to help solve other problems.

Let's combine `give.first.and.last()` with `lapply()` and `sapply()` to extract the first and last state in each component of our list.
```{r comment="", results='markup'}
lapply(state.list, give.first.and.last)
sapply(state.list, give.first.and.last)
```
Note how `sapply()` noticed that `give.first.and.last()` produces exactly two values for each component of the list and went ahead and simplified the result into a 2 by 4 table. Let's first sort the states within each region and then extract the first and last states. This will give us the first and last state in alphabetical order.
```{r comment="", results='markup'}
sapply(lapply(state.list,sort), give.first.and.last)
```

For many functions built into R you can see what they do by typing the name of the function. Here's how R computes the interquartile range of a collection of values.
```{r comment="", results='markup'}
IQR
```
You can see that it computes the 0.25 quantile and the 0.75 quantile and uses `diff()` to compute their difference.

## Exercises
`r .exNum('Make a function \x60is.island(x)\x60 returns \x60TRUE\x60 if \x60x\x60 is an island')`. Islands are "HI", "FM", "MH", "PW", "AS", "GU", "MP", "PR", "VI", "UM". Borrow the template I used for `give.first.and.last()`. Then try using the `%in%` operator
`r .exNum("Count how many islands are within each region. Use an \x60sapply()\x60 (or two) and your new \x60is.island()\x60 function")`
`r .exNum("Which components of \x60b\x60 having missing values? Use \x60is.na()\x60")`. `b` was defined earlier

# Matrices and apply()

A matrix is a collection of values of the same type (all numbers or all text or all logical values) with one or more rows and one or more columns. Let's create a matrix with some random numbers.
```{r comment="", results='hold'}
a <- matrix(sample(1:5,size=12,replace=TRUE),nrow=4)
a
```
This matrix has two dimensions, 4 rows and 3 columns. You can use square brackets to select elements from the matrix.
```{r comment="", results='hold'}
a[1,2]     # element in first row, second column
a[1,]      # the entire first row
a[,2]      # the entire second column
a[-1,-1]   # dropping the first row and first column
a[3:4,2:3] # rows 3 & 4, columns 2 & 3
```
The numbers to the left of the comma index rows and the numbers to the right of the comma index columns. The `apply()` function, like the `lapply()` and `sapply()` functions, allows you to apply a function to all the rows or all the columns of a matrix. `apply()` needs the name of the matrix, whether you want to apply the function to the first dimension (rows) or the second dimension (columns), and the name of the function to apply.
```{r comment="", results='hold'}
apply(a, 1, sum)     # compute sum of each row
apply(a, 2, sum)     # compute sum of each column
apply(a, 1, mean)    # compute mean of each row
apply(a, 1, summary) # summarize each row
```
We can also create a new function right on the spot to compute something on each row or column. Let's find the minimum and maximum values in each row and find out if all the values are greater than 1.
```{r comment="", results='hold'}
apply(a, 1, function(x) {c(min(x),max(x))}) # there is also a function range()
apply(a, 1, function(x) {all(x>1)})
```


# Setting the working directory
Now that we have covered a lot of fundamental R features, it is time to load in a real dataset. However, before we do that, R needs to know where to find the data file. So we first need to talk about "the working directory". When you start R, it has a default folder or directory on your computer where it will retrieve or save any files. You can run `getwd()` to get the current working directory. Here's our current working directory, which will not be the same as yours.
```{r comment=""}
getwd()    
```
Almost certainly this default directory is *not* where you plan to have all of your datasets and files stored. Instead, you probably have an "analysis" or "project" or "R4crim" folder somewhere on you computer where you would like to store your data and work.

Use `setwd()` to tell R what folder you want it to use as the working directory. If you do not set the working directory, R will not know where to find the data you wish to import and will save your results in a location in which you would probably never look. Make it a habit to have `setwd()` as the first line of every script you write. If you know the working directory you want to use, then you can just put it inside the `setwd()` function.
```
setwd("C:\Users\gridge\Google Drive\R4crim")    
```
Note that for all platforms, Windows, Macs, and Linux, the working directory only uses forward slashes. So Windows users be careful... most Windows applications use backslashes, but in an effort to make R scripts work across all platforms, R requires forward slashes. Backslashes have a different use in R that you will meet later.

If you do not know how to write your working directory, here comes R Studio to the rescue. In R Studio click Session -> Set Working Directory -> Choose Directory. Then click through to navigate to the working directory that you want to use. When you find it click "Select Folder". Then look over at the console. R Studio will construct the right `setwd()` syntax for you. Copy and paste that into your script for use later. No need to have to click through the Session menu again now that you have your `setwd()` set up.

Now you can use R functions to load in any datasets that are in your working folder. If you have done your `setwd()` correctly, you shouldn't get any errors because R will know exactly where to look for the data files. If the working directory that you've given in the `setwd()` isn't right, R will think the file doesn't even exist. For example, if you give the path for, say, your R4econ folder, R won't be able to load data because the file isn't stored in what R thinks is your working directory. With that out of the way, let's load a dataset.

# Data frames
A data frame is a special case of a list where all the components of the list have the same number of elements. Think about each component of the list being a "column" in your dataset. R can load in datasets from numerous sources (plain text, Excel files, databases, websites, etc.) including .RData format, R's unique data format. There is an extensive guide to [importing and exporting datasets](https://cran.r-project.org/doc/manuals/r-release/R-data.pdf).

To import data in the .RData format use `load()`. A [sample of Chicago crime data](https://github.com/gregridgeway/R4criminology/blob/master/chicago%20crime%2020141124-20141209.RData) is available on the [R4Crim github site](https://github.com/gregridgeway/R4crim).
```{r comment="", results='hold'}
load("chicago crime 20141124-20141209.RData")
```
List the objects R now has in memory and you will see that there is a new object, `chicagoCrime`.
```{r comment="", results='hold'}
ls()
```
If you did not spell the name of the .RData file exactly correctly, then R will give you an error. A common occurrence when downloading the same file from the web multiple times is for your web browser to add numbers to the multiple versions you've downloaded. So check the file name carefully. Here's what happens when I request a file that doesn't exist.
```{r comment="", results=TRUE, warning=TRUE, error=TRUE}
load("chicago crime.RData")
```
If you get an error like this, then go double check that the file name is spelled exactly and you have correctly set the working directory.

Once you successfully load in a dataset, we can begin to explore it. Let's check that this is indeed a dataset. You can use the `is()` function on any R object to ask it to identify itself.
```{r comment="", results='hold'}
is(chicagoCrime)
```
You can see that `chicagoCrime` is of type `data.frame`... and it is also of type `list`. That means that anything that you can do to lists, like `lapply()` and `sapply()`, you can use on `chicagoCrime` too. 

What are the names of the variables in the dataset?
```{r comment="", results='hold'}
names(chicagoCrime)
```
As expected, the data have information the crime date, crime type, location (including latitude and longitude), whether an arrest occurred, and more.

Let's look at some parts of the dataset.
```{r comment="", results='markup'}
#   look at the first three rows
chicagoCrime[1:3,]
#   look at the first three rows and first three columns
chicagoCrime[1:3,1:3]
#   look up by the columns by name
chicagoCrime[1:3,c("Latitude","Longitude")]
```

Ask R what types of values each of the crime features contain.
```{r comment="", results='hold'}
#   look at the types of each variable
sapply(chicagoCrime, is)
```

That gives a lot of detailed information. Here's a trick to just get the first value for each one.
```{r comment="", results='hold'}
sapply(chicagoCrime, function(x) is(x)[1])
```

Use `table()` and `sort()` to see what kinds of crimes are in this dataset.
```{r comment="", results='hold'}
#   tabulate crimes
sort(table(chicagoCrime$Primary.Type))
sort(table(chicagoCrime$Description))
```
Note how we can use the `$` to extract just the `Primary.Type` and just the `Description` components of the dataset. 

Just using `chicagoCrime$District` will give us all the values in that column.
```{r comment="",results='hold'}
chicagoCrime$District
```

If we want only those rows for crimes which happen in Chicago's District 10:
```{r comment="",results='hold'}
chicagoCrime[chicagoCrime$District==10,]
```

If we want only those rows for crimes which happen in Chicago's District 10, but only look at the values in the column `Primary.Type`:
```{r comment="",results='hold'}
chicagoCrime$Primary.Type[chicagoCrime$District==10]
```

What kinds of crimes occur in Chicago's District 10?
```{r comment="", results='hold'}
sort(table(chicagoCrime$Primary.Type[chicagoCrime$District==10]))
```
All these `chicagoCrime$`s are making our code long and harder to read. But we need to tell R to look inside `chicagoCrime` to find `Primary.Type` and `District`. `with()` can greatly simplify R code. Tell R to sort the table as before, but tell R that it can find all of the variables it is looking for in the `chicagoCrime` data frame.
```{r comment="", results='hold'}
with(chicagoCrime, sort(table(Primary.Type[District==10])))
```
Much easier to read and understand!

## Exercises
`r .exNum("Display three randomly selected rows")`
`r .exNum("Count \x60NA\x60s in each column")`
`r .exNum("Look up \x60Location.Description\x60, \x60Block\x60, \x60Beat\x60, and \x60Ward\x60 for those missing \x60Latitude\x60")`

# For loops
Sometimes we need to have R repeat certain tasks multiple times, such as marching through each row of a dataset and modifying values. For loops accomplish this. Later in this course we will be using Google Maps to extract information about addresses. So we might need to iterate through every row in the dataset, check whether the latitude and longitude are missing, and if missing try to retrieve the latitude and longitude from Google Maps. The last crime in the dataset missing coordinates is in row 9954.
```{r comment="", results='hold'}
chicagoCrime[9954,]
```
While the coordinates are missing, the street address, 081XX S THROOP ST, is (mostly) there. Chicago PD has masked the last two digits of the address so that we really only know the location down to the nearest block. Let's look up 8150 S Throop St, likely near the middle of the block, to see where this is. The Google Maps URL is [https://www.google.com/maps/place/8150+S+Throop+St,+Chicago,+IL](https://www.google.com/maps/place/8150+S+Throop+St,+Chicago,+IL). It would be a pain to have type out each of these URLs for every address that we wanted to look up. So let's learn a little bit about for loops to see how this might work.

Here is a basic for loop that runs through the numbers 1 to 10 and prints them out one at a time.
```{r comment="", results='hold'}
for(i in 1:10)
{
   print(i)
}
```
Note the basic structure. There's the keyword `for`. Inside the parentheses is a variable `i` (but you can use any variable name you want), the keyword `in`, and finally a collection of values, in this case the numbers 1 to 10. The for loop will march through this collection of values, assigning `i` each value in turn, and running the code inside the squiggly braces. So first `i` will be set to 1 and the `print()` function will print the value 1 to the screen. When that is done, `i` will take the next value in the collection, a 2, and the for loop will run the `print()` function will print the number 2. This continues until `i` takes the value 10 and `print()` prints that 10 to the screen.

Let's loop through all the states, printing out which number they are in the collection along with the state postal code.
```{r comment="", results='hold'}
for(i.state in 1:length(state.names))
{
   print(c(i.state,state.names[i.state]))
}
```
Let's loop through all the letters of the alphabet and see if that letter is in the word "CRIME". `cat()` is like `print()`, but just dumps to the screen exactly what you give it^[Why "cat" you ask? Programmers in the early 1970s created a program called "cat" to concatenate files together, but most uses of "cat" were to just dump file contents to the screen or to some other program.]. `print()` will do some formatting to try to present the results a little nicer.
```{r comment="", results='hold'}
for(letter in c("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O",
                "P","Q","R","S","T","U","V","W","X","Y","Z"))
{
   print(letter)
   if(letter %in% c("C","R","I","M","E"))
      cat("The letter",letter,"is in the word 'CRIME'\n")
}
```
Actually, R has a built in collection, `LETTERS`, that contains all of the capital letters. There really was no need to type them all out. This works too.
```{r comment="", results='hold'}
for(letter in LETTERS)
{
   print(letter)
   if(letter %in% c("C","R","I","M","E"))
      cat("The letter",letter,"is in the word 'CRIME'\n")
}
```
Let's loop through the states and check whether each one is an island or not.
```{r comment="", results='hide', echo=FALSE}
is.island <- function(x)
{
  islands <- c("HI","FM","MH","PW","AS","GU","MP","PR","VI","UM")
  return(x %in% islands)
}
```

```{r comment="", results='hold'}
for(nm.state in state.names)
{
   print(nm.state)
   if(is.island(nm.state))
      cat(nm.state," is an island\n")
}

```
Let's get back to our original problem of having R construct all the Google Map URLs that we need. First, we will create a new variable in the dataset called `google.maps.url` and fill it with empty text.
```{r comment="", results='hold'}
chicagoCrime$google.maps.url <- ""
```

Now let's loop through all 10,000 rows in the dataset. First, R will use `gsub()` to replace the XX in the house number with 50, so we get the location in the middle of the block. `gsub()` is like a Find-and-Replace function, but way more powerful and flexible. We will use it extensively when covering regular expressions. After fixing the house number, we use `paste()` to assemble a URL suitable for looking up addresses on Google Maps.
```{r comment="", results='hold'}
time4ForLoop <- system.time(  # system.time() is like a stop watch
for(i in 1:nrow(chicagoCrime))
{
   a <- gsub("XX", "50", chicagoCrime$Block[i])
   chicagoCrime$google.maps.url[i] <- paste("https://www.google.com/maps/place/",
                                            a,
                                            ",+Chicago,+IL",sep="")
}
)
```
Note that we've wrapped the for loop with a call to `system.time()`. This will keep the time on how long this for loop takes. When creating these notes on a laptop it took `r time4ForLoop[3]` seconds. Not bad. Much faster than having to type out these 10,000 URLs. However, if we had one million addresses, then this code is going to take much more time.

In fact, in R for loops are *very* slow. They are so slow that R programmers attempt to avoid them whenever possible. We can actually accomplish the same task without using a for loop. `gsub()` will accept a whole collection of addresses and modify them all at once. `paste()` also will accept a collection of text values and paste them together with the other parts.
```{r comment="", results='hold'}
timeWithoutForLoop <- system.time(
{
a <- gsub("XX","50",chicagoCrime$Block)
chicagoCrime$google.maps.url <- paste("https://www.google.com/maps/place/",
                                      a,
                                      ",+Chicago,+IL",sep="")
}
)
```
This took `r timeWithoutForLoop[3]` seconds. That's `r round(time4ForLoop[3]/timeWithoutForLoop[3],1)` times faster than the for loop.

## Exercises
`r .exNum('Use a for loop to create a variable \x60Coordinates\x60 that looks like "(X.Coordinate,Y.Coordinate)"')`
     + Use `paste()` with the `X.Coordinate` and `Y.Coordinate` variables
     + Remember the `sep=` option in `paste()`
     + You might find using the `with()` function to simplify your code and avoid having a lot of `chicagoCrime$`s
`r .exNum("Redo the previous exercise without using a for loop and compare computation time")`

# More tabulating, aggregating, and breaking statistics down by group
The variable `Arrest` indicates whether someone was arrested for the crime. Here are the first 10 values.
```{r comment="", results='hold'}
chicagoCrime$Arrest[1:10]
```
We can compute the percentage of crimes with an arrest by calculating how often on average `Arrest=="true"`.
```{r comment="", results='hold'}
mean(chicagoCrime$Arrest=="true")
```
The `aggregate()` function will do this same calculation, but has options for breaking it down by some other crime feature. Let's use `aggregate()` to compute the percentage of crimes with an arrest by ward. We store the result in `a`.
```{r comment="", results='hold'}
a <- aggregate((Arrest=="true")~Ward, data=chicagoCrime, mean)
a
```
The first part of `aggregate()` gives an R formula for how we want the data broken up. On the left of the `~` is the outcome or feature that we want to study. Here it is whether or not `Arrest` has value true. To the right of the `~` is the feature by which we want to break down the arrests, ward in this case. Then we need to tell `aggregate()` in which data frame it can find `Arrest` and `Ward`. Lastly, we need to tell `aggregate()` what to do with the outcome we are studying. Here we are asking `aggregate()` to compute the mean so that we get an arrest percentage.

As a result, we have a dataframe of two columns. In the left column, we have the ward number. In the right column, we have the fraction of crimes that result in an arrest: `Arrest=="true"`.

We can use `barplot()` to compare arrest percentages by ward.
```{r comment="", fig.width=6.5}
barplot(a$`(Arrest == "true")`, 
        names.arg = a$Ward,
        cex.names = 0.5,
        ylab      = "Fraction arrested",
        xlab      = "Ward")
```

Note that the column in `a` containing the arrest fraction has a complicated name with several special symbols like `==` and `"`. R will get very confused unless we "protect" this variable name with the backquotes (also called backticks). You can visit the help for `barplot()` with `?barplot` to learn what all the arguments do.

Frequently we will focus on just a subset of the data. For example, we might just want to study assaults rather than all crimes. The `subset()` function does this for us like `subset(data, Primary.Type=="ASSAULT")`. This is particularly useful to use in combination with `with()`. Let's create a table of the number of arrests by ward, but only for assaults.
```{r comment="", results='hold'}
with(subset(chicagoCrime,Primary.Type=="ASSAULT"), 
     table(Arrest,Ward))
```

Let's recreate our barplot, but now just using assaults.
```{r comment="", fig.width=6.5}
a <- aggregate((Arrest=="true")~Ward, 
               data=subset(chicagoCrime,Primary.Type=="ASSAULT"), 
               mean)
barplot(a$`(Arrest == "true")`, 
        names.arg = a$Ward,
        cex.names = 0.5,
        ylab      = "Fraction arrested",
        xlab      = "Ward",
        main      = "Arrest fraction for assaults")
```

## Exercises
`r .exNum('How many assaults occurred in the street? (\x60Location.Description=="STREET"\x60)')`. Try using `subset()` even though there are other ways
`r .exNum("What percentage of assaults occurred in the street by Ward?")`

# Plotting Data

R enables us to plot points. The points we plotted form the shape of Chicago... which makes total sense because we're using Chicago crime data. 
```{r comment="", fig.width=6.5}
plot(Latitude~Longitude, data=chicagoCrime)
```

The `plot()` function here uses the same R formula syntax as the `aggregate()` function. The variable on the left of `~` is the outcome, plotted on the y-axis, and the variable on the right appears on the x-axis. And, of course, we need to tell `plot()` that it can find these variables inside the `chicagoCrime` data frame.

Let's plot the district with the most crime. The first line here tabulates how many crimes occurred in each district, sorts those counts, reverse the sorted list so that the largest one comes first, extracts the first one in the collection using `[1]` and then uses `names()` to extract the name of the district (rather than how many crimes occurred in that district). You can see all of District 8's crimes (that's the district with the most crimes) appearing as red points in the plot.
```{r comment="", fig.width=6.5}
# selects district 8, with 713 crimes
max.district <- names(rev(sort(table(chicagoCrime$District)))[1]) 
plot(Latitude~Longitude,
     data=subset(chicagoCrime, District!=max.district),   # not in District 8
     pch=".",                                             # plot with tiny dot
     xlab="Longitude",ylab="Latitude")
points(Latitude~Longitude,                              
       data=subset(chicagoCrime, District==max.district), # in District 8
       pch=".",
       col="red")
```

R tries to set up default graphics settings so that most plots look okay, but sometimes it takes a little more work to adjust them. The good thing is that R lets you adjust everything. So let's make a barplot of the number of crimes of each type.
```{r comment="", fig.width=6.5}
barplot(table(chicagoCrime$Primary.Type))
```

The labels on the bars are so long that only a few of them appear. So let's spend a little more time, write a few more lines of R code, and make this plot look right.

```{r comment="", fig.width=6.5}
tab <- table(chicagoCrime$Primary.Type)   # tabulate crime counts
# give 2.5in on the left margin to give lots of space for the crime type labels
par(pin=c(6.5,6),                         # set plot dimensions (inches)
    mai=c(1.02, 2.5, 0, 0.3))             # set plot margins 
a <- barplot(tab,
             col="salmon",                # change the bars' color
             horiz=TRUE,                  # make the bars horizontal 
             names.arg=rep("",nrow(tab)), # put no labels on the bars
             xlab="Number of crimes")
# add the bar labels on the y-axis
axis(2,                                   # set up the y-axis label (axis #2)
     at=a[,1],                            # midpoints of bars stored in a[,1]
     cex.axis=0.7,                        # shrink the axis text size by 30%
     labels=names(tab),                   # the bar labels
     las=1,                               # make labels horizonal (see ?par)
     tick=FALSE)                          # no tick marks on the axis
# add the actual number on the bars
text(ifelse(tab<80, 180, tab-5),          # x-coord of text, 
                                          #   if bar too small, put text to right
     a[,1],                               # y-coord of text,  midpoint of bars 
     tab,                                 # text to add to the plot
     cex=0.7,                             # shrink text (cex=character expansion)
     adj=1)                               # right justify text
```

## Exercises
`r .exNum("Make a barplot indicating how many states are in each region. Use \x60state.list\x60")`
`r .exNum("Identify the beat with the most crimes")`
`r .exNum("Identify the beat with the most domestic violence incidents")`
`r .exNum("Part 1 crimes are homicide, robbery, assault, arson, burglary, theft, sex offense, motor vehicle theft. Calculate the number of Part 1 crimes in Chicago")`

# Solutions to the exercises 
1. `r .exerciseQuestions[1]`
```{r comment=""}
(1:49)*2
```
or
```{r comment=""}
seq(2,98,by=2)
```

2. `r .exerciseQuestions[2]`
```{r comment=""}
mean((1:49)*2)
```

3. `r .exerciseQuestions[3]`
```{r comment=""}
sort(c("WA","DC","CA","PA","MD","VA","OH"))
```

4. `r .exerciseQuestions[4]`
```{r comment=""}
state.names[51]
```

5. `r .exerciseQuestions[5]`
```{r comment=""}
state.names[c(7,8,21,24,28,32,35,46)]
```
or sort first so that all the M states are together
```{r comment=""}
sort(state.names)[20:27]
```
Here's another possible answer that uses `substring` (which we haven't covered yet):
```{r comment=""}
state.names[substring(state.names, 1, 1)=="M"]
```

6. `r .exerciseQuestions[6]`
Of course, these may vary depending on where you have lived.
```{r comment=""}
state.names[c(1, 4, 10, 26)]
```

7. `r .exerciseQuestions[7]`
```{r comment=""}
sort(state.names)[51]
```
or
```{r comment=""}
rev(sort(state.names))[1]
```

8. `r .exerciseQuestions[8]`
```{r comment=""}
rev(sort(state.names))[1:3]
```

9. `r .exerciseQuestions[9]`
```{r comment=""}
my.states <- c("PA", "NJ", "NY", "MD", "DE", "MA", "RI", "CT", "ME", "LA", "IN")
state.names %in% my.states
```

10. `r .exerciseQuestions[10]`
```{r comment=""}
a <- 1:100
a[a %% 2==1 & a>50 & a<75]
```

11. `r .exerciseQuestions[11]`
```{r comment=""}
state.names[state.names>"LZ" & state.names<"N"]
```

12. `r .exerciseQuestions[12]`
```{r comment=""}
a <- sample(1:6, size=100000, replace=TRUE)
table(a)[6]/length(a)
```
Or
```{r comment=""}
sum(a==6)/length(a)
```
Or
```{r comment=""}
mean(a==6)
```

13. `r .exerciseQuestions[13]`
```{r comment=""}
dice1 <- sample(1:6, size=1000, replace=TRUE)
dice2 <- sample(1:6, size=1000, replace=TRUE)
doubleroll <- dice1 + dice2
mean(doubleroll==7)   # should be close to 1/6 or 0.1666...
```

14. `r .exerciseQuestions[14]` (Answers will vary)
```{r comment=""}
sample(state.names, size=5, replace=FALSE)
```

15. `r .exerciseQuestions[15]`
   + Tabulate how often each state was selected (Answers will vary)
```{r comment=""}
a <- sample(state.names, size=1000, replace=TRUE)
table(a)
```

   + Which state was selected the least? (Answers will vary)
```{r comment=""}
sort(table(a))[1]
```

16. `r .exerciseQuestions[16]`
```{r comment=""}
state.list$east <- state.list$east[state.list$east!="DC"]
state.list$other <- c(state.list$other, "DC")
state.list
```
Or
```{r comment=""}
state.list$east <- setdiff(state.list$east, "DC")
state.list$other <- c(state.list$other, "DC")
state.list
```

17. `r .exerciseQuestions[17]`
```{r comment=""}
sort(c(state.list$east, state.list$central))
```
Or
```{r comment=""}
with(state.list, sort(c(east, central)))
```

18. `r .exerciseQuestions[18]`
```{r comment=""}
is.island <- function(x)
{
   return(x %in% c("HI", "FM", "MH", "PW", "AS", "GU", "MP", "PR", "VI", "UM"))
}
```

19. `r .exerciseQuestions[19]`

First, this `lapply()` asks each state if they are an island.
```{r comment=""}
lapply(state.list, is.island)
```
Now we want to count up how many `TRUE`s there are in each component, so wrap this `lapply()` with an `sapply()`
```{r comment=""}
sapply(lapply(state.list, is.island), sum)
```

20. `r .exerciseQuestions[20]`
```{r comment=""}
sapply(lapply(b, is.na), any)
```
Or
```{r comment=""}
b <- list(0:9, c("A","B","C"), c(TRUE,FALSE,NA))
sapply(b, function(x) any(is.na(x)))
```

21. `r .exerciseQuestions[21]`
```{r comment=""}
chicagoCrime[sample(1:nrow(chicagoCrime), size=3),]
```

22. `r .exerciseQuestions[22]`
```{r comment=""}
sapply(lapply(chicagoCrime, is.na), sum)
```
Or
```{r comment=""}
sapply(chicagoCrime, function(x) sum(is.na(x)))
```

23. `r .exerciseQuestions[23]`
```{r comment=""}
i <- is.na(chicagoCrime$Latitude)
# Let's just show the first 5 rows
i <- which(i)[1:5]
chicagoCrime[i,c("Location.Description","Block","Beat","Ward")]
```
Or
```{r comment=""}
subset(chicagoCrime, is.na(chicagoCrime$Latitude),
       select=c("Location.Description","Block","Beat","Ward"))[1:5,]
```

24. `r .exerciseQuestions[24]`
```{r comment=""}
system.time(
for (i in 1:nrow(chicagoCrime))
{
   chicagoCrime$coords[i] <- paste0(chicagoCrime$X.Coordinate[i], ", " , 
                                    chicagoCrime$Y.Coordinate[i])
}
)
```
Or
```{r comment=""}
system.time(
for (i in 1:nrow(chicagoCrime)) 
{
   chicagoCrime$coords2[i] <- with(chicagoCrime, 
                                   paste("(",X.Coordinate[i], ",",
                                             Y.Coordinate[i],")",sep=""))
}
)
```
25. `r .exerciseQuestions[25]`
```{r comment=""}
system.time(
chicagoCrime$coords3 <- with(chicagoCrime, 
                             paste0("(", X.Coordinate, ",",Y.Coordinate,")"))
)
```

26. `r .exerciseQuestions[26]`
```{r comment=""}
with(subset(chicagoCrime, Primary.Type=="ASSAULT"), 
     sum(chicagoCrime$Location.Description=="STREET"))
```

27. `r .exerciseQuestions[27]`
```{r comment=""}
aggregate((Location.Description=="STREET")~Ward,
          data=subset(chicagoCrime, Primary.Type=="ASSAULT"),
          mean)
```

28. `r .exerciseQuestions[28]`
```{r comment=""}
barplot(sapply(state.list, length))
```

29. `r .exerciseQuestions[29]`
```{r comment=""}
names(rev(sort(table(chicagoCrime$Beat)))[1])
```
Or
```{r comment=""}
names(which.max(table(chicagoCrime$Beat)))
```

30. `r .exerciseQuestions[30]`
```{r comment=""}
with(subset(chicagoCrime, Description=="DOMESTIC BATTERY SIMPLE"),
     names(which.max(table(Beat))))
```

31. `r .exerciseQuestions[31]`
```{r comment=""}
sum(chicagoCrime$Primary.Type %in% c("HOMICIDE", "ROBBERY", "ASSAULT", "ARSON", 
                                     "BURGLARY", "THEFT", "SEX OFFENSE", 
                                     "MOTOR VEHICLE THEFT"))
```