-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathHomework 1.Rmd
135 lines (87 loc) · 7.86 KB
/
Homework 1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
title: "Homework 1 | STAT 330"
subtitle: <center> <h1>Simple Linear Regression</h1> </center>
author: <center> Joshua Carpenter <center>
output: html_document
---
<style type="text/css">
h1.title {
font-size: 40px;
text-align: center;
}
</style>
```{r setup, include=FALSE}
# load any necessary packages here
library(tidyverse)
```
## Data and Description
Energy can be produced from wind using windmills. Choosing a site for a wind farm (i.e. the location of the windmills), however, can be a multi-million dollar gamble. If wind is inadequate at the site, then the energy produced over the lifetime of the wind farm can be much less than the cost of building the operation. Hence, accurate prediction of wind speed at a candidate site can be an important component in the decision to build or not to build. Since energy produced varies as the square of the wind speed, even small errors in prediction can have serious consequences.
One possible solution to help predict wind speed at a candidate site is to use wind speed at a nearby reference site. A reference site is a nearby location where the wind speed is already being monitored and should, theoretically, be similar to the candidate site. Using information from the reference site will allow windmill companies to know the wind speed at the candidate site without going through a costly data collection period, if the reference site is a good predictor.
The Windmill data set contains measurements of wind speed (in meters per second m/s) at a **candidate site (CSpd) (column 1)** and at an accompanying **reference site (RSpd) (column 2)** for 1,116 areas. Download the Windmill.txt file from Canvas, and put it in the same folder as this R Markdown file.
#### 0. Replace the text "< PUT YOUR NAME HERE >" (above next to "author:") with your full name.
#### 1. Briefly explain why simple linear regression is an appropriate tool to use in this situation.
We would like to know how a change in wind speed at a reference site affects the wind speed at a candidate site. That is a perfect scenario for linear regression. Since we only have one explanatory variable, we will use simple linear regression.
#### 2. Read in the data set, and call the data frame "wind". Print a summary of the data and make sure the data makes sense.
```{r, message=FALSE}
(wind <- read_table2("Windmill.txt"))
summary(wind)
```
#### 3. What is the outcome variable in this situation? (Think about which variable makes the most sense to be the response.)
The outcome variable is the wind speed at the candidate site.
#### 4. What is the explanatory variable in this situation?
The explanatory variable is the wind speed at the reference site.
#### 5. Create (and output) a scatterplot of the data with variables on the appropriate axes. Make you plot look professional and you follow the Homework Rules (under Module 0). (Make sure the axes have appropriate limits to capture the data nicely, make sure the axes labels are descriptive with units, etc.).
```{r, fig.align='center'}
(wind.plot <- ggplot(data = wind, mapping = aes(x = RSpd, y = CSpd)) +
geom_point() +
xlab("Wind Speed at Reference Site (m/s)") +
ylab("Wind Speed at Canidate Site (m/s)") +
xlim(0, 23) +
ylim(0, 25) +
coord_fixed() +
theme_minimal())
```
#### 6. Briefly describe the relationship between RSpd and CSpd. (Hint: you should use 2 or 3 key words.)
The two variables of interest appear to have a fairly strong, positive, linear correlation.
#### 7. Calculate the correlation coefficient for the two variables. Print the result.
```{r}
cor(wind$RSpd, wind$CSpd)
```
#### 8. Briefly interpret the number you calculated for the correlation coefficient (what is the direction and strength of the correlation?).
There is a moderate/strong positive correlation. As wind speed at one location increases, so does wind speed at the corresponding location.
#### 9. Mathematically write out the theoretical/general simple linear regression model for this data set (using parameters ($\beta$s), not estimates). Clearly explain which part of the model is deterministic and which part is random. Do not use "x" and "y" in your model - use variable names that are descriptive.
$\text{CanidateSpeed}_i=\beta_0+\beta_1\cdot\text{RefSpeed}_i+\epsilon_i$
The right-hand side of the equation constitutes the model. The $\epsilon_i$ is the error term or the random part of the model. The rest is the deterministic part.
#### 10. Add the OLS regression line to the scatterplot you created in 4. Print the result.
```{r, fig.align='center', message=FALSE}
wind.plot + geom_smooth(method = "lm", se = FALSE)
```
#### 11. Apply linear regression to the data, and save the residuals and fitted values to the `wind` data frame. Print out a summary of the results from the `lm` function.
```{r}
model <- lm(CSpd ~ RSpd, data = wind)
summary(model)
wind <- wind %>%
mutate(resid = resid(model), fitted_val = fitted(model))
```
#### 12. Briefly explain the rational behind the ordinary least-squares model fit.
The method of OLS regression finds a line that minimizes the squared difference between the predicted values and the actual values. Essentially, we minimize the variance of the data from the fitted line.
#### 13. Mathematically write out the fitted simple linear regression model for this data set using the coefficients you found above (do not use parameters/$\beta$s). Do not use "x" and "y" in your model - use variable names that are fairly descriptive.
$\widehat{\text{CanidateSpeed}}_i=3.1412+0.7557\cdot\text{RefSpeed}_i$
#### 14. Interpret the coefficient for the slope.
For every increase of 1 meter per second in the wind speed at the reference site, the wind speed at the corresponding candidate site will increase, on average, 0.7557 meters per second.
#### 15. Interpret the coefficient for the intercept.
If there were a reference site with a wind speed of 0 meters per second, we would predict the average wind speed of the corresponding candidate sites to be 3.1412 meter per second.
#### 16. What is the average wind speed at the candidate site (CSpd) when the wind speed at the reference site (RSpd) is 12 m/s? Show your code, and print the result.
```{r}
predict(model, tibble(RSpd = 12))
```
#### 17. Briefly explain why it would be wrong to answer this question: What is the average wind speed at the candidate site (CSpd) when the wind speed at the reference site (RSpd) is 25 m/s?
Our regression model did not go that high and we have no idea if the model is accurate beyond the data that we used.
#### 18. Calculate the MSE, or the average squared variability of the residuals around the line. Show your code, and print the result.
```{r}
sum(wind$resid^2) / model$df.residual
```
#### 19. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.
I knew nearly everything in this analysis already, however, I did not know about the assumptions of linear regression, namely that the data is independent and normally distributed with a uniform standard deviation.
#### 20. Briefly summarize what you learned from this analysis *to a non-statistician*. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.
The purpose of this analysis was to determine an effective way to predict wind speed at a candidate site to receive a windmill given wind speed data from a nearby reference site. A brief perusal of the data allowed us to determine that wind speed at a candidate site could be easily predicted by a simple formula. We determined that as average wind speed of a reference site increases, the average wind seed of a candidate site increases by about 75% of the increase of the reference site. We have yet to do an in-depth analysis of the precision of this prediction.