-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathHomework 8.Rmd
230 lines (146 loc) · 10.3 KB
/
Homework 8.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: "Homework 8"
subtitle: <center> <h1>Poisson Regression</h1> </center>
author: <center> Joshua Carpenter <center>
output: html_document
---
<style type="text/css">
h1.title {
font-size: 40px;
text-align: center;
}
</style>
```{r setup, include=FALSE}
library(devtools)
install_github("mtnbikerjoshua/jcreg@development")
library(tidyverse)
library(jcreg)
library(gridExtra)
library(car)
```
## Data and Description
**NOTE: For this assignment, you do not need to make your graphs/plots look "pretty", meaning you do not need to worry about changing the axis limits, relabeling axes, etc.**
Bike sharing systems are the new generation of traditional bike rentals where the process from membership, rental and return back has become automatic. Through these systems, users are able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bikesharing programs around the world which is composed of over 500,000 bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
The bike-sharing rental process is highly correlated with environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the volume of rentals. This dataset is composed from the two-year historical data corresponding to years 2011 and 2012 from the Capital Bikeshare system in Washington D.C. The daily counts of the number of bikes used was extracted and then the corresponding weather and seasonal information was added.
The data set has information for 731 days and contains the following variables:
Variable | Description
---------- | -------------
season | Season (Fall, Spring, Summer, Winter)
yr | Year (2011, 2012)
holiday | Was the day a holiday (Yes/No)?
workingday | Was the day a working day (Yes/No)? (Yes if the day is neither a weekend nor a holiday)
weathersit | Weather (Clear, Light Precip, Misty)
temp | Normalized temperature in Celsius
hum | Normalized humidity
windspeed | Normalized windspeed
cnt | Number of bikes rented
The data can be found in the Bikes data set on Canvas. Download Bikes.csv, and put it in the same folder as this R Markdown file.
#### 0. Replace the text "< PUT YOUR NAME HERE >" (above next to "author:") with your full name.
#### 1. Read in the data set, and call the data frame "bikes". Make sure the yr and character variables are factors (if they are not, you'll need to make them factors). Print a summary of the data and make sure the data makes sense.
```{r}
bikes <- read.csv("Bikes.csv", stringsAsFactors = TRUE) %>%
as_tibble()
```
#### 2. Explore the data: create a histogram for the response. *Briefly describe the shape of the distribution - you should mention (1) symmetry or skewness, (2) the number of modes, and (3) potential outliers.*
```{r, fig.align='center'}
ggplot(data = bikes, mapping = aes(x = cnt)) +
geom_histogram()
```
The data is symmetric with one clear mode and maybe two others. There are no clear outliers.
#### 3. Briefly explain why traditional multiple linear regression methods are not suitable for *this* data set. You should mention the four reasons we discussed in class (*your reasons should each refer to this data set*).
Since traditional regression can have negative or continuous predictions, that will cause a problem as you can't have negative or fractional bike rentals. Also, since there are more data points gathered on the low end of the spectrum (few bike rentals) the linearity assumption might be violated. Since bike rentals are not normally distributed, the residuals probably aren't either and you could make a smilar argument about homoscedasticity.
#### 4. Use a variable selection procedure to help you decide which, if any, variables to omit from the Poisson regression model you will soon fit. You may choose which selection method to use (best subsets, forward, backward, sequential replacement, LASSO, or elastic net) and which metric/criteria to use (AIC, BIC, or CV/PMSE).
```{r, fig.align='center', message=FALSE}
var_selection(bikes, method = c("backward", "seqrep"), family = poisson(link = "log"))
```
#### 5. Write out the Poisson regression model for this data set using the covariates that you see fit. You should use parameters/Greek letters (NOT the "fitted" model using numbers...since you have not fit a model yet;) ). Be sure to use indicator variables, if necessary. (You will need to split the equation on multiple lines to have it render properly as an HTML file.)
$$\text{rental_cnt}\stackrel{iid}{\sim}\text{Poisson}(\mu_i)$$
$$\log(\mu_i)=\beta_0+\beta_1\cdot I(\text{yr}_i=2012)+\beta_2\cdot\text{temp}_i+\beta_3\cdot\text{hum}_i+\beta_4\cdot\text{windspeed}_i+
\beta_5\cdot I(\text{holiday}_i=\text{Yes})$$
where $\mu_i$ is the average number of bicycle rentals per day.
#### 6. Fit a Poisson regression model using the covariates that you used in the previous question (use the `glm` function - do not just call the result from the variable selection procedure). Print a summary of the results.
```{r, fig.align='center'}
bikes_model <- glm(cnt ~ yr + temp + hum + windspeed + holiday,
data = bikes, family = poisson)
summary(bikes_model)
```
### The next several questions involve using diagnostics to check the Poisson regression model assumptions. For each assumption, (1) code the diagnostic(s) that I indicate (next to the assumption in parentheses) to determine if the assumption is violated, and (2) explain whether or not you think the assumption is violated and why you think that.
#### 7. The X's vs log(y) are linear (use scatterplots and partial regression (added-variable) plots)
```{r, fig.align='center'}
scatterplot_bikes <- function(x, y, xlab, ylab) {
ggplot(mapping = aes(x, y)) +
geom_point() +
theme_bw() +
theme(aspect.ratio = 1) +
xlab(xlab) +
ylab(ylab)
}
bikes <- bikes %>%
mutate(logcnt = log(cnt))
bikes %>%
select(temp, hum, windspeed) %>%
mapply(FUN = scatterplot_bikes, x = ., xlab = names(.),
MoreArgs = list(y = bikes$logcnt, ylab = "log(cnt)"), SIMPLIFY = FALSE)
jcreg_av(bikes_model)
```
The scatterplot for `temp` looks like it has a non-linear correlation. I would want to try to transform before doing regression. this assumption is not met.
#### 8. The residuals are independent (no diagnostic tools - just think about how the data was collected and briefly write your thoughts)
There is potential for dependency across days, but I thing our predictors take care of that.
#### 9. The model describes all observations (i.e., there are no influential points) (use DFFITS)
```{r, fig.align='center'}
jcreg_dffits(bikes_model)
```
There are at least two clear outliers that could be influential (observations 69 and 668). This assumption is not met.
#### 10. Additional predictor variables are not required (no diagnostic tools - just think about the variables you have and if there are other variables you think would help predict the response)
This seems like a pretty comprehensive set of predictors. I can't think of any others that would help predict bike rentals.
#### 11. No multicollinearity (use variance inflation factors)
```{r, fig.align='center'}
vif(bikes_model)
```
The VIF's are all low (close to one). This assumption is met.
#### 12. Mean = Variance (no overdispersion/underdispersion) (use the three methods discussed in class)
```{r, fig.align='center'}
# Compare mean and variance
mean(bikes$cnt)
var(bikes$cnt)
# Conduct Chi-sq test
pchisq(bikes_model$deviance, bikes_model$df.residual, lower.tail = FALSE)
bikes_quasipoisson <- glm(cnt ~ yr + temp + hum + windspeed,
data = bikes, family = quasipoisson)
summary(bikes_quasipoisson)
```
This assumption is not met. There is very great overdispersion.
### Regardless of your assessment of the assumptions, proceed as if all assumptions were met.
#### 13. For the coefficient for holiday, compute (and output) $\beta_{holiday}$ (pull this value from the model output), $\exp\{\beta_{holiday}\}$, and $100 \times (\exp\{\beta_{holiday}\} - 1)%$.
```{r, fig.align='center'}
bikes_model$coefficients["holidayYes"]
```
#### 14. Interpret the coefficient for holiday based on the last TWO different ways we discussed in class (for negative coefficients).
*Interpretation 1:* < your response here >
*Interpretation 2:* < your response here >
#### 15. Create (and output) 95% confidence intervals for $\beta_k$, $\exp\{\beta_k\}$, and $100 \times (\exp\{\beta_k\} - 1)%$ for all predictors using the `confint` function.
```{r, fig.align='center'}
# your code here
```
#### 16. Interpret the 95% confidence intervals for temp for $\beta_{temp}$, $\exp\{\beta_{temp}\}$, and $100 \times (\exp\{\beta_{temp}\} - 1)%$ (three interpretations total).
*Interpretation using $\beta_{temp}$:* < your response here >
*Interpretation using $\exp\{\beta_{temp}\}$:* < your response here >
*Interpretation using $100 \times (\exp\{\beta_{temp}\} - 1)%$:* < your response here >
#### 17. Calculate (and output) a 95% confidence interval (and point estimate) for the predicted average number of bike rentals for a day where season = "Spring", yr = "2012", holiday = "No", workingday = "Yes", weathersit = "Misty", temp = 0.34, hum = 0.80, and windspeed = 0.18. Note that you may not need to use all of these values depending on the variables you chose to include in your model. *Interpret the interval.*
```{r, fig.align='center'}
# your code here
```
< your response here >
#### 18. Compute (and output) the likelihood ratio test statistic for the model, and compute (and output) the associated $p$-value. Based on the results, what do you conclude?
```{r, fig.align='center'}
# your code here
```
< your response here >
#### 19. Compute (and output) the pseudo $R^2$ value for the model.
```{r, fig.align='center'}
# your code here
```
#### 20. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.
< your response here >
#### 21. Briefly summarize what you learned from this analysis *to a non-statistician*. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.
< your response here >