-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathHomework 6.Rmd
224 lines (145 loc) · 13.8 KB
/
Homework 6.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
title: "Homework 6"
subtitle: <center> <h1>Multiple Linear Regression Additional Variable Types</h1> </center>
author: <center> Joshua Carpenter <center>
output: html_document
---
<style type="text/css">
h1.title {
font-size: 40px;
text-align: center;
}
</style>
```{r setup, include=FALSE}
library(tidyverse)
```
## Data and Description
**Note that for the sake of length for this homework assignment, I am not having you check the model assumptions. You certainly can, if you would like, and in "real life" you would definitely need to do this prior to any statistical inference.**
**Note that you do not need to worry about making your plots look professional for this assignment. They should still, obviously, be readable, but you do not need to change the x- and y-axis limits. I trust you know how to do this by now, and this will save you some time.**
Macroeconomists often speculate that life expectancy is linked with the economic well-being of a country. Macroeconomists also hypothesize that Organisation for Economic Co-operation and Development (OECD) (an international think tank charged with promoting policies that will improve global social and economic well-being) members will have longer life expectancy. To test these hypotheses, the LifeExpectancy.txt data set (found on Canvas) contains the following information:
Variable | Description
-------- | -------------
LifeExp | Average life expectancy in years
Country | Country name
Group | Is the country a member of OECD, Africa, or other?
PPGDP | Per person GDP (on the log scale)
The Group variable indicates if the country is a member of the OECD, a member of the African continent, or belonging to neither group (other). Note that the Country variable is just for your reference - you will not use this variable in your model.
Download LifeExpectancy.txt, and put it in the same folder as this R Markdown file.
#### 0. Replace the text "< PUT YOUR NAME HERE >" (above next to "author:") with your full name.
#### 1. Read in the data set, and call the data frame "life". Print a summary of the data and make sure the data makes sense.
#### 2. Make the Group categorical variable a factor class (instead of a character class).
```{r}
life <- read.csv("LifeExpectancy.txt", sep = " ", row.names = 1)
life <- tibble(life) %>%
mutate(Group = factor(Group))
life
summary(life)
```
#### 3. Create and print a scatterplot with the response on the $y$-axis and the other continuous variable on the $x$-axis. Comment on the the relationship between these two variables.
```{r, fig.align='center'}
(scatterplot <- ggplot(data = life,
mapping = aes(x = PPGDP, y = LifeExp, color = Group)) +
geom_point() +
theme_minimal() +
theme(aspect.ratio = 1) +
ylab("Life Expectancy") +
xlab("Per Person GDP"))
```
There is clearly a strong linear relationship between these two variables.
#### 4. Create and print a boxplot with the response on the $y$-axis and the categorical variable on the $x$-axis. Comment on the the relationship between these two variables.
```{r, fig.align='center'}
ggplot(data = life, mapping = aes(x = Group, y = LifeExp)) +
geom_boxplot() +
theme_classic() +
ylab("Life Expectancy")
```
Countries in Africa have mmuch lower life expectancies than other countries and countries in the OECD are closely clustered with a higher average life expectancy.
#### 5. Create and print a color-coded scatterplot using all of the variables that will be in your model. Hint: plot the response on the $y$-axis, the other continuous variable on the $x$-axis, and color the points by the categorical variable.
```{r, fig.align='center'}
scatterplot
```
#### 6. Write out the general/theoretical model (using Greek letters/parameters) that you are thinking about applying to this data set (you will not write out the fitted model using coefficients, because you have not fit a model yet;)). DO NOT include interactions at this step. Remember, you will need to use dummy variables for Group. **USE "other" AS THE BASELINE CATEGORY**. Use variable names that are descriptive (not $y$, $x_1$, etc.).
$$\text{LifeExp}_i\sim N\left(\mu=\beta_0+\beta_1\cdot\text{PPGDP}_i+\beta_2\cdot I(\text{Group}_i=\text{Africa})+\beta_3\cdot I(\text{Group}_i=\text{OECD})$$
#### 7. Create dummy variables for the "africa" and "oecd" levels of Group.
```{r}
life <- life %>%
mutate(GroupAfrica = ifelse(Group == "africa", 1, 0),
GroupOECD = ifelse(Group == "oecd", 1, 0)) %>%
select(-Group)
```
#### 8. Fit a multiple linear regression model to the data (no transformations, interactions, etc.) **using the dummy variables you created**. *USE "other" AS THE BASELINE CATEGORY FOR GROUP*. Print a summary of the results.
```{r}
life_lm <- lm(LifeExp ~ PPGDP + GroupAfrica + GroupOECD, data = life)
summary(life_lm)
```
#### 9. Briefly interpret the intercept (like we did in class). **Note that you will need to use the word "average" (or similar) twice since you are predicting an average already.** You will need to do this here and with the questions following, when interpreting.
Countries with 0 GDP Per Person that are not in Africa and are not members of the OECD would theoretically have a mean life expectancy of 51.0 years on average.
#### 10. Briefly interpret the coefficient for PPGDP (log scale) (like we did in class). You do not need to un-transform anything - you can just write something like "per person GDP (log scale)" in your response.
For countries in the same group, we would expect a one unit increase in the log of per person GDP to result in a 2.9 year increase in average life expectancy.
#### 11. Briefly interpret the coefficient for I(Group=OECD).
Average life expectancy for countries in the OECD is 1.5 years higher than for countries with the same per person GDP not in the OECD and not in Africa.
#### 12. For equal per person GDP (log scale), how does life expectancy change for countries that are members of the OECD compared to countries that are on the African continent? Show how you obtained this number, and briefly interpret this number (like we did in class).
$$1.52983-(-12.29427)=13.8241$$
Average life expectancy for countries in the OECD is 13.8 years higher than for countries with the same per person GDP in Africa not in the OECD.
#### 13. Create 95% confidence intervals for all coefficients (use the `confint` function).
```{r}
confint(life_lm)
```
#### 14. Briefly interpret the 95% confidence interval for I(Group=Africa).
We are 95% confident that the the average life expectancy for countries in Africa is between 11.7 and 12.8 years lower than average life expectancy in other countries with the same per person GDP.
#### 15. Use the `anova` function to conduct a hypothesis test that tests some coefficients simultaneously. Specifically, test if Group has a significant effect on LifeExp. What do you conclude from the result of the test? Hint: you will need to create another linear model and compare it with the one you made previously.
```{r}
noGroup_lm <- lm(LifeExp ~ PPGDP, data = life)
anova(life_lm, noGroup_lm)
```
With a very small p-value, we conclude that Group has a significant effect on the model and should be included.
#### 16. Create a 95% confidence interval for the expected/average average life expectancy for a country in the OECD with an average per person GDP (log scale) of 9.5. Print the result, and briefly interpret this interval (like we did in class). (Use the `predict` function.)
```{r}
predict(life_lm, interval = "confidence",
newdata = data.frame(PPGDP = 9.5, GroupAfrica = 0, GroupOECD = 1))
```
We are 95% confident that the mean of average life expectancy for countries in the OECD with a per person GDP of 9.5 is between 79.4 and 80.2.
#### 17. Create a 95% prediction interval for the average life expectancy of a country in the OECD with an average per person GDP (log scale) of 9.5. Print the result, and briefly interpret this interval (like we did in class). (Use the `predict` function.)
```{r}
predict(life_lm, interval = "prediction",
newdata = data.frame(PPGDP = 9.5, GroupAfrica = 0, GroupOECD = 1))
```
We are 95% confident that if we observe a new country in the OECD with a per person GDP of 9.5, the average life expectancy for that country will be between 77.7 and 82.0.
#### 18. Plot the fitted model on the scatterplot with the two continuous variables on the axes, colored by the categorical variable. Hint: you should have 3 different lines on your plot, and you will *not* need to have different line types or point shapes (you *will* need to have different colors).
```{r, fig.align='center', message = FALSE}
scatterplot +
geom_line(mapping = aes(y = predict(life_lm, newdata = life)), size = 1)
```
#### 19. Write out the general/theoretical model (using Greek letters/parameters) for a model with PPGDP, Group, and an interaction between PPGDG and Group. Remember, you will need to use dummy variables for Group. **USE "other" AS THE BASELINE CATEGORY**. Use variable names that are descriptive (not $y$, $x_1$, etc.).
$$\text{LifeExp}_i\sim N\left(\mu=\beta_0+\beta_1\cdot\text{PPGDP}_i+\beta_2\cdot I(\text{Group}_i=\text{Africa})+\beta_3\cdot I(\text{Group}_i=\text{OECD})+\beta_4\cdot\text{PPGDP}_i\cdot I(\text{Group}_i=\text{Africa})+\beta_5\cdot\text{PPGDP}_i\cdot I(\text{Group}_i=\text{OECD}), \sigma\vphantom{a\atop b}\right)$$
#### 20. Fit a multiple linear regression model to the data **using the dummy variables you created**, and include an interaction term between PPGDP and Group. *USE "other" AS THE BASELINE CATEGORY FOR GROUP*. Print a summary of the results.
```{r, fig.align='center'}
life_lm_inter <- lm(LifeExp ~ PPGDP + GroupAfrica + GroupOECD + PPGDP:GroupAfrica + PPGDP:GroupOECD, data = life)
summary(life_lm_inter)
```
#### 21. Use the `anova` function to test if the overall interaction between PPGDP and Group is significant. Print the result. What do you conclude?
```{r}
anova(life_lm, life_lm_inter)
```
Based on the very small p-value, we conclude that the overall interaction is significant and should be included in the model.
#### 22. Plot the fitted model (with the interaction included) on the scatterplot with the two continuous variables on the axes, colored by the categorical variable. Hint: you should have 3 different lines on your plot, and you will *not* need to have different line types or point shapes (you *will* need to have different colors).
```{r, fig.align='center'}
scatterplot +
geom_smooth(method = "lm", se = FALSE)
```
#### 23. How did the fitted lines change when you included an interaction term compared with when you did not include an interaction term?
The slope of the line for countries in the OECD changed.
#### 24. What is the effect of PPGDP on LifeExp for countries in a country other than those in the OECD or Africa (i.e. in the "other" category)? You should report a number in a complete sentence (as done in class toward the end of the notes). Since this is a continuous-categorical interaction, and since we are focusing on the effect of the continuous variable, you should use the "one unit increase" terminology in your response.
For countries not in the OECD and not in Africa, we expect a one unit increase in PPGDP to result in a 2.94 year increase in life expectancy.
#### 25. What is the effect of PPGDP on LifeExp for countries in the OECD? You should report a number in a complete sentence (as done in class toward the end of the notes). Since this is a continuous-categorical interaction, and since we are focusing on the effect of the continuous variable, you should use the "one unit increase" terminology in your response.
For countries in the OECD, we expect a one unit increase PPGDP to result in a 1.99 year average increase in average life expectancy.
#### 26. What is the effect of PPGDP on LifeExp for countries in Africa? You should report a number in a complete sentence (as done in class toward the end of the notes). Since this is a continuous-categorical interaction, and since we are focusing on the effect of the continuous variable, you should use the "one unit increase" terminology in your response.
For countries in the OECD, we expect a one unit increase PPGDP to result in a 2.90 year average increase in average life expectancy.
#### 27. What is the effect of belonging to the OECD on LifeExp for countries with a PPGDP of 9? You should report a number in a complete sentence (as done in class toward the end of the notes).
For countries with a per person GDP of 9, we expect belonging to the OECD to result in an average increase of 2.72 years in average life expectancy.
#### 28. What is the effect of belonging to the OECD on LifeExp for countries with a PPGDP of 11? You should report a number in a complete sentence (as done in class toward the end of the notes).
For countries with a per person GDP of 9, we expect belonging to the OECD to result in an average increase of 0.81 years in average life expectancy.
#### 29. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.
I learned how to perform ANOVA between two linear models. As it happens, that is super easy!
#### 30. Briefly summarize what you learned from this analysis *to a non-statistician*. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.
The purpose of this analysis was to come up with a method of predicting average life expectancy of a country based off of per person GDP and whether they are part of the OECD or are in Africa. We determined that we can make such a prediction with good accuracy. We determined the both per person GDP and which group the country is in have an effect on the country's average life expectancy. We also determined that what group a country is in changes how per person GDP affects average life expectancy.