-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path04 Working with NCVS data.Rmd
330 lines (263 loc) · 17 KB
/
04 Working with NCVS data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
---
title: "Working with National Crime Victimization Survey Data"
author:
- affiliation: University of Pennsylvania
email: [email protected]
name: Greg Ridgeway
- affiliation: University of Pennsylvania
email: [email protected]
name: Ruth Moyer
- affiliation: University of Pennsylvania
email: [email protected]
name: Li Sian Goh
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
css: htmlstyle.css
---
<!-- HTML YAML header Ctrl-Shift-C to comment/uncomment -->
<!-- --- -->
<!-- title: "Working with National Crime Victimization Survey Data" -->
<!-- author: -->
<!-- - Greg Ridgeway ([email protected]) -->
<!-- - Ruth Moyer ([email protected]) -->
<!-- date: "`r format(Sys.time(), '%B %d, %Y')`" -->
<!-- output: -->
<!-- pdf_document: -->
<!-- latex_engine: pdflatex -->
<!-- html_document: default -->
<!-- fontsize: 11pt -->
<!-- fontfamily: mathpazo -->
<!-- --- -->
<!-- PDF YAML header Ctrl-Shift-C to comment/uncomment -->
<!-- A function for automating the numbering and wording of the exercise questions -->
```{r echo=FALSE}
.counterExercise <- 0
.exerciseQuestions <- NULL
.exNum <- function(.questionText="")
{
.counterExercise <<- .counterExercise+1
.exerciseQuestions <<- c(.exerciseQuestions, .questionText)
.questionText <- gsub("@@", "`", .questionText)
return(paste0(.counterExercise,". ",.questionText))
}
```
# Introduction
Through our work with the UCR, we've already discussed reported crime. Nonetheless, not all crimes are reported to the police. Also, sometimes the UCR doesn't provide us with specific information about a victim-involved crime incident such as whether the victim knew the offenders or the location of the crime incident.
Each year, the U.S. Census Bureau conducts the National Crime Victimization Survey (NCVS), which is a valuable source of self-reported victimization data. The Census Bureau interviews a sample of people about the number and characteristics of crime victimizations they experienced during the prior 6 months. In 2015, for example, they collected data from 95,760 households and 163,880 persons.
The NCVS contains valuable information about nonfatal personal crimes such as rape or robbery as well as property crimes such as burglary. Additional information about the NCVS can be found at the [BJS website](https://www.bjs.gov/index.cfm?ty=dcdetail&iid=245). To give a sense of the type of data that the NCVS contains, refer to the [Official 2012-2013 BJS Crime Victimization report](http://www.bjs.gov/content/pub/pdf/cv12.pdf).
# Acquiring the NCVS data
The University of Michigan consolidates the NCVS data into a format that is easily accessible in R. We will be using 2012 and 2013 NCVS data.
First, we will download the NCVS 2012 data, [ICPSR 34650](http://www.icpsr.umich.edu/icpsrweb/NACJD/studies/34650). We will need to download the following files, DS1, DS2, DS3, DS4, and DS5 in R format. Also, download DS0, the Codebook (which is in PDF format). We will refer to the codebook frequently. As for the DS1, DS2, DS3, DS4, and DS5 files, we are interested in the .rda files.
Next, downoad the NCVS 2013 data, [ICPSR 35164](http://www.icpsr.umich.edu/icpsrweb/NACJD/studies/35164). Same drill as above - retrieve DS1, DS2, DS3, DS4, and DS5 in R format.
All told you should have ten .rda files, and one PDF codebook. The codebook is extremely important for understanding what the variable names stand for, and you should become familiar with it as soon as you can. For now, we won't be using the DS5 files that much. Also, the file names are admittedly a bit unwieldy with all the numbers so it might be a good idea to change the names to something that will help you quickly distinguish among all the files. We've created subfolders called NCVS2012 and NCVS2013 that contains the files extracted from the data download. Here are the files we have in our NCVS2012 and NCVS2013 subfolders.
```{r comment=""}
list.files("NCVS2012/",recursive = TRUE)
list.files("NCVS2013/",recursive = TRUE)
```
Let's see what's in these .rda files. The DS1s for both 2012 and 2013 are the address record-type files. First, 2012:
```{r comment="", results='hold'}
load("NCVS2012/DS0001/34650-0001-Data.rda")
ls()
head(da34650.0001)
```
As you can see, the DS1 for 2012 contains a unique identifer for each interviewed household. Let's load the address record-type file for 2013.
```{r comment="", results='hold'}
load("NCVS2013/DS0001/35164-0001-Data.rda")
```
Let's give these address record-type files for 2012 and 2013 more useful names.
```{r comment="", results='hold'}
dataAddr12 <- da34650.0001
dataAddr13 <- da35164.0001
```
By contrast, DS2 contains household information. Let's load the household data and give them more useful names.
```{r comment="", results='hold'}
load("NCVS2012/DS0002/34650-0002-Data.rda")
load("NCVS2013/DS0002/35164-0002-Data.rda")
dataHH12 <- da34650.0002
dataHH13 <- da35164.0002
```
The DS3 files contain person specific information whereas the DS4 files provide incident information. Let's load them and give them useful names.
```{r comment="", results='hold'}
load("NCVS2012/DS0003/34650-0003-Data.rda")
load("NCVS2013/DS0003/35164-0003-Data.rda")
dataPers12 <- da34650.0003
dataPers13 <- da35164.0003
load("NCVS2012/DS0004/34650-0004-Data.rda")
load("NCVS2013/DS0004/35164-0004-Data.rda")
dataInc12 <- da34650.0004
dataInc13 <- da35164.0004
```
Now that we've loaded and renamed all the files we'll need, we can remove objects from our working environment that we no longer need. We can use `rm()` to accomplish this:
```{r comment="", results='hold'}
rm(da34650.0001,da34650.0002,da34650.0003,da34650.0004,
da35164.0001,da35164.0002,da35164.0003,da35164.0004)
```
Let's examine in a bit more detail the first three rows of the person file. The dataset contains `r ncol(dataPers12)` columns so we will just show the first 40 columns here. Note IDHH (household ID), IDPER (person ID), and the relationship between the first two rows. Also, note that V3077 (Variable #3077) refers to who responded to the survey.
```{r comment="", results='hold'}
dataPers12[1:3, 1:40]
```
Let's examine the corresponding household information. This dataset also has a lot of features so we will just show here the first 53 of `r ncol(dataHH12)` columns.
```{r comment="", results='hold'}
subset(dataHH12, IDHH=="2501017260961929294229224")[,1:53]
```
And the corresponding incident file (just the first 43 of `r ncol(dataInc12)` columns):
```{r comment="", results='hold'}
dataInc12[1:3, 1:43]
```
Let's look at the month and year of crime incident variables
```{r comment="", results='hold'}
with(dataInc12, table(V4014,V4015))
with(dataInc13, table(V4014,V4015))
```
# Creating a dataframe and weights with NCVS incident data
Next, we can create a 2012 incident dataframe. Importantly, the 2012 data contain incidents that occurred in 2012 as well as 2011 but were all self-reported to the Census Bureau in 2012. Likewise, the 2013 data contain incidents that occurred in 2012 as well as 2013. If we wanted to analyze crime that occurred in only 2012, we'd subset the data to include only 2012. We will combine the 2012 and 2013 incident dataframes and then subset this new dataframe so that we exclude 2011 and 2013. As we can see in the Codebook PDF, the variable V4015 refers to the year of occurrence. (Helpful hint: the numbering of the variables correlate to the numbering of the dataframe. The incident-level file is DS4. Many of the variables in DS4 are V4XXX.)
`rbind` binds rows. This is good for when the columns in two datasets are exactly the same.
```{r comment="", results='hold'}
dataInc <- rbind(dataInc12,dataInc13)
table(dataInc$V4015) # year crime occured
dataInc <- subset(dataInc, V4015==2012)
```
We will also want to exclude crime that happens outside the United States or crimes for which we do not know the location (NA). According to the Codebook, V4022 refers to location.
```{r comment="", results='hold'}
dataInc <- subset(dataInc, (V4022!="(1) Outside U.S.") | is.na(V4022))
```
A lot of crimes happen in a series. The BJS convention is to include up to 10 occurrences in a series crime.
```{r comment="", results='hold'}
i <- with(dataInc, which((V4019=="(2) No (is series)") & (V4016>=11) & (V4016<=996)))
dataInc$V4016[i] <- 10
dataInc$V4016[dataInc$V4016>=997] <- NA
```
Also, BJS analyses of NCVS data generally use weights because NCVS is survey data. We want to weight the survey data so that they are representative of the wider U.S. population! There are three NCVS weight categories: household, personal, and incident.
For more information about NCVS weights, consult the section on Weighting Information found at this ICPSR resource guide to the NCVS: (https://www.icpsr.umich.edu/icpsrweb/NACJD/NCVS/accuracy.jsp).
To that extent, let's update the weight for series crimes and create a "date year" weight.
```{r comment="", results='hold'}
i <- which(dataInc$V4019=="(2) No (is series)")
dataInc$WGTVICDY <- dataInc$WGTVICCY
dataInc$WGTVICDY[i] <- with(dataInc, WGTVICDY[i] * V4016[i])
```
We can also tabulate total weight by crime type to estimate the count of a crime. As the Codebook instructs, `V4529` is the variable for crime type.
```{r comment="", results='hold'}
aggregate(WGTVICDY~V4529, data=dataInc, sum)
```
As you can see, there are some irregularities with the coding of crime types. Sometimes a type is coded as "(01)", but other times it is coded as "(1)". Let's standardize this coding using regular expressions.
```{r comment="", results='hold'}
dataInc$V4529 <- gsub("\\(([1-9])\\)", "(0\\1)", dataInc$V4529)
aggregate(WGTVICDY~V4529, data=dataInc, sum)
```
Now, we can use the NCVS incident data to find out how many car thefts occurred in 2012.
```{r comment="", results='hold'}
with(subset(dataInc, V4529=="(40) Motor veh theft"),
sum(WGTVICDY))
```
Also, note that the [definition of rape changed](http://www.fbi.gov/about-us/cjis/cjis-link/march-2012/ucr-program-changes-definition-of-rape) in 2013.
```{r comment="", results='hold'}
with(subset(dataInc,V4529=="(01) Completed rape"),
sum(WGTVICDY))
```
# Merging in data from the household and person data
So far, we've created a dataframe and worked with weights for the Incident data. However, the Household and Person Data have data that we might need. Let's first create a 2012 data year household data frame, much like we did with the incident data. Note that `YEARQ` refers to the year and quarter of the interview. The variable `V2130` is the month allocated from panel/rotation number. The panel/rotation number refer to [the process](https://www.bjs.gov/content/pub/pdf/ncvs_methodology.pdf) through which interviews are conducted.
```{r comment="", results='hold'}
dataHH <- rbind(dataHH12,dataHH13)
dataHH <- subset(dataHH, YEARQ>=2012.1 & YEARQ<=2013.2)
```
Let's make the "month allocated" uniform, and using regular expressions, delete "0s" following parentheses.
```{r comment="", results='hold'}
table(dataHH$V2130)
dataHH$V2130 <- gsub("\\(0", "\\(", dataHH$V2130)
table(dataHH$V2130)
```
When you view the table again, you can see that the original 21 months listed were condensed into 12.
Next, create a 2012 data year person data frame. We need to first fix incompatible factor/numeric in 2012/2013. The factor levels in 2012 look like "(1) Yes", but in 2013 are just "1."
```{r comment="", results='hold'}
i <- sapply(dataPers12,levels) #gives factor levels for each variable
i <- i[!sapply(i,is.null)] #gives factor levels for each factor variable
#for non-factor variables, i returns a null result
i <- sapply(i, function(x) all(substring(x,1,1)=="(")) #store in i those variables where the first #character of factor levels is "(""
var.fix <- names(i)[i] #this gives us the name of variables
#where factor levels begin with "(".
for(xj in var.fix) #create a for-loop to fix these variable names. for each value "xj" in var.fix
{
dataPers12[,xj] <- gsub("\\(([0-9]+)\\).*","\\1",dataPers12[,xj]) #remove the words that follow the #numbers in parentheses
dataPers12[,xj] <- as.numeric(dataPers12[,xj]) #convert the numbers in parentheses to just numeric #variables, which present as numbers w/o parentheses
}
```
Then, stack the 2012 and 2013 data frames using `rbind()`.
```{r comment="", results='hold'}
dataPers <- rbind(dataPers12, dataPers13)
dataPers <- subset(dataPers, YEARQ>=2012.1 & YEARQ<=2013.2)
```
Now that we've created a person dataframe and an incident dataframe, we can merge them together. We will use `merge()` to pull age, marital status, and sex into the incident data. The `merge()` function has several parameters that communicate to R which features should be used to match and which ones should be merged. Here we tell `merge()` to use use a pair of features from the incident data (`IDPER` and `YEARQ`) and look up a row in `dataPers` with the same values of `IDPER` and `YEARQ`. We've selected only the five columns `IDPER`, `YEARQ`, `V3014`, `V3015`, and `V3018` from `dataPers`. The first two `merge()` uses to identify matching rows and the last three will be attached as new columns to `dataInc`.
```{r comment=""}
a <- merge(dataInc, # incident data
dataPers[,c("IDPER","YEARQ", # IDPER & YEARQ unique IDs of person
"V3014", # age
"V3015", # marital status
"V3018")], # sex
by=c("IDPER","YEARQ"), # variables used to merge
all.x=TRUE) # keep all incidents, even if not matched
# a should have the same number of rows as dataInc, but 3 additional new columns
dim(dataInc)
dim(a)
# replace dataInc with a, now containing age, marital, and sex
dataInc <- a
# check merge for first incident
dataInc[1,c("IDPER","YEARQ","V3014","V3015","V3018")]
# check dataPers for this person's age, marital, and sex
subset(dataPers, IDPER=="250105121075958229372843501" & YEARQ==2012.3,
select = c("IDPER","YEARQ","V3014","V3015","V3018"))
```
We can see that the first row of `dataInc` now has three additional columns, and that they have the correct values merged from the `dataPers` data.
Let's give these new columns better names.
```{r comment="", results='hold'}
names(dataInc)[names(dataInc)=="V3014"] <- "age"
names(dataInc)[names(dataInc)=="V3015"] <- "marital"
names(dataInc)[names(dataInc)=="V3018"] <- "sex"
```
Let's also create a new variable that breaks age into age categories.
```{r comment="", results='hold'}
dataInc$ageGroup <- cut(dataInc$age, breaks=c(0,16,21,35,45,60,110))
```
Note that "8" is a missing value indicator for marital status. Always refer to the Codebook if you are not sure what a variable or a categorical variable value means.
```{r comment="", results='hold'}
dataInc$marital[dataInc$marital==8] <- NA
```
Factor variables in R put meaningful labels on categorical variables. Instead of working with the numbers 1-5 for marital status, let's assign the number values their actual corresponding names.
```{r comment="", results='hold'}
dataInc$marital <- factor(dataInc$marital, levels=1:5,
labels=c("married","widowed","divorced",
"separated","never married"))
dataInc$sex <- factor(dataInc$sex, levels=1:2,
labels=c("male","female"))
```
Let's get estimated counts by age group and sex.
```{r comment="", results='hold'}
aggregate(WGTVICDY~ageGroup+sex, data=dataInc, FUN=sum)
```
We can also find out common crime type by sex. As before, `aggregate()` will total up the weights, but as you see in the ageGroup/sex example above, `aggregate()` produces the results in a long form. Sometimes this is useful, but sometimes we want to have our results side-by-side. We will use `reshape()` to convert the "long format" results from `aggregate()` to a "wide format".
```{r comment="", results='hold'}
a <- aggregate(WGTVICDY~V4529+sex, data=dataInc, FUN=sum)
a <- reshape(a, timevar="sex", idvar="V4529", direction="wide")
a[is.na(a)] <- 0
names(a) <- c("crimeType","male","female")
a
```
We can then convert this result to column percentages. To obtain a column percentage, we divide counts for an individual cell by the total number of counts for the column. So, the sum of all the values in the male column should equal 100:
```{r comment=""}
temp <- a
temp$male <- with(temp, 100*male/ sum(male))
temp$female <- with(temp, 100*female/sum(female))
colSums(temp[,-1]) # check that the columns sum to 100
temp$ratio <- temp$female/temp$male
temp[order(-temp$ratio),]
```
Or we can compute row percentages to determine what percentage of each crime is male and female.
```{r comment="", results='hold'}
temp <- a
row.total <- with(temp, male+female)
temp$male <- with(temp, 100*male/ row.total)
temp$female <- with(temp, 100*female/row.total)
rowSums(temp[,-1]) # check that the rows sum to 100
temp$ratio <- temp$female/temp$male
temp[order(-temp$ratio),]
```