Instructions
- Fork this repository to your GitHub account.
- Write your solutions in R Markdown in a file named
index.Rmd
. - Compile your solutions to an HTML file,
index.html
. You can view it athttps://{username}.github.io/Assignment_04
. - When you are ready to submit your assignment, initiate a pull request. Title your pull request "Submission".
To update your fork from the upstream repository:
- On your fork, e.g.
https://github.com/jrnold/Assignment_04
click on "New Pull request" - Set your fork
jrnold/Assignment_04
as the base fork on the left, andUW-POLS503/Assignment_04
as the head fork on the right. In both cases the branch will be master. This means, compare any canes in the head fork that are not in the base fork. You will see differences between theUS-POLS503
repository and your fork. Click on "Create Pull Request", and if there are no issues, "Click Merge" A quick way is to use this link, but change thejrnold
to your own username:https://github.com/jrnold/Assignment_04/compare/gh-pages...UW-POLS504:gh-pages
.
We'll use these packages,
library("MASS")
library("pols503")
library("foreign")
library("dplyr")
library("broom")
library("ggplot2")
These are some utility functions used in this assignment
mvnormal_data <- function(n,
mu = 0,
sigma = rep(1, length(mu)),
cor = diag(length(mu)),
empirical = TRUE,
colnames = paste0("x", seq_along(mu))
) {
setNames(as.data.frame(MASS::mvrnorm(n, mu = mu,
sdcor2cov(sigma, cov2cor(cor)),
empirical = empirical)),
colnames)
}
Given
r2_to_sigma <- function (yhat, r2) {
ybar <- mean(yhat)
ssm <- sum((yhat - ybar)^2)
sse <- (1 - r2)/r2 * ssm
sqrt(sse/ length(yhat))
}
Summarize the results of the simulations:
summarize_params <- function(.data) {
ret <- .data %>%
group_by(term) %>%
summarize(estimate_mean = mean(estimate),
estimate_sd = sd(estimate),
std_error_mean = mean(std.error),
std_error_sd = sd(std.error),
estimate_mean_se = sd(estimate) / sqrt(n()),
estimate_sd_se = sd(estimate) / sqrt(2 * (n() - 1)),
std_error_mean_sd = sd(std.error) / sqrt(n()),
std_error_sd_se = sd(std.error) / sqrt(2 * (n() - 1)),
iter = length(estimate))
ret
}
In each question, we will perform a Monte Carlo simulation in which we
- draw a sample from a population (that often violates a CLR assumption)
- estimate OLS on that sample
- Repeat steps 1-2 many times to generate a sampling distribution of the coefficients
- Repeat steps 1-3 many for several different populations to understand how the sampling distribution varies with features such as the number of observations in the sample, correlation between covariates, etc.
Simulate data from:
$$
Y = X_1 + \beta_2 X_2 + \varepsilon
$$
where
In these simulations we will vary:
-
$n$ : The sample size -
$rho$ : The correlation between$x_1$ and$x_2$ -
$\beta_2$ : The coefficient on$x_2$
simulate_ovb <- function(iter, n, rho, beta2) {
n <- 100
rho <- 0
r2 <- 0.5
beta <- c(0, 1, beta2)
cormat <- matrix(c(1, rho, rho, 1), nrow = 2, ncol = 2)
dat <- mvnormal_data(n, mu = c(0, 0), cor = cormat)
yhat <- model.matrix(~ x1 + x2, data = dat)
sigma <- r2_to_sigma(yhat, r2)
results <- vector(mode = "list", length = iter)
for (i in seq_len(iter)) {
# Simulate y
dat[["y"]] <- yhat + rnorm(n, sd = sigma)
# Estimate OLS
results <- tidy(lm(y ~ x1, data = dat))
}
summarize_params(results)
}
sims_ovb <-
expand.grid(rho = c(0, 0.3, .7),
beta2 = c(1, 0.5, 0),
obs = c(30, 100, 1000)) %>%
group_by(rho, beta2, obs) %>%
do({
simulate_ovb(2048, n = .$obs, rho = .$rho, beta2 = .$beta2)
})
save(sims_ovb, file = "data/sims_ovb.RData")
Use the simulation results to answer the following questions. Use plots in your answers.
How do the values of
- Bias of
$\hat{beta}_1$ ,$E(\hat{\beta}_1) - \beta_1$ - The standard deviation of
$\hat{\beta}_1$ , i.e.$\sd(\hat{\beta})$ . - The bias of the standard error of
$\hat{\beta}_1$ , i.e.$\E(\se(\hat{\beta}_1)) - \sd(\hat{\beta}_1)$
Population regression function
$$
Y = \beta_0 + \beta_1 \vec{x}_1 + \beta_2 + \varepsilon
$$
where
The sample regression function is
$$
y_i = \hat\beta_0 + \hat\beta_1 z_{1,i} + \hat\beta_2 z_{2,i} + \varepsilon
$$
Instead of observing variables
Rather than vary the variances of
In these simulations we will vary
-
$n$ : The sample size -
$r_1$ : The reliability of$x_1$ -
$r_2$ : The reliability of$x_2$ -
$\rho$ : The correlation between$x_1$ and$x_2$
simulate_measurement_error <- function(iter, n, rho, reliability) {
# Regression coefficients
beta <- c(0, 1, 1)
# Correlation between X
cormat <- matrix(c(1, rho, rho, 1), nrow = 2, ncol = 2)
# desired population R^2 (used to generate sigma)
r2 <- 0.5
# X drawn from a multivariate normal distribution
dat <- mvnormal_data(n, mu = c(0, 0), cor = cormat)
# yhat = X b
yhat <- model.matrix(~ x1 + x2, data = dat) %*% beta
# Regression standard deviation
sigma <- r2_to_sigma(yhat, r2)
results <- vector(mode = "list", length = iter)
meas_sd <- sqrt((1 - reliability) / reliability)
dat2 <- dat
for (i in seq_len(iter)) {
# Simulate y
dat2[["y"]] <- yhat + rnorm(n, sd = sigma)
for (j in 1:2) {
dat2[[paste0("x", j)]] <- dat[[paste0("x", j)]] + rnorm(n, sd = meas_sd[j])
}
# Estimate OLS
results <- tidy(lm(y ~ x1 + x2, data = dat2))
}
summarize_params(results)
}
sims_measurement_error <-
expand.grid(reliability1 = c(.01, .7, 1),
reliability2 = c(.01, .7, 1),
rho = c(0, 0.3, .7),
obs = c(30, 100, 1000)) %>%
group_by(rho, obs, reliability1, reliability2) %>%
do({
simulate_measurement_error(2048,
rho = .$rho,
n = .$obs,
reliability = c(.$reliability1,
.$reliability2))
})
save(sims_measurement_error, file = "data/sims_measurement_error.RData")
Use the simulation results to answer the following questions. Use plots in your answers.
How do the values of
-
Bias of
$\hat{beta}_1$ ,$E(\hat{\beta}_1) - \beta_1$ . In particular, focus on the following cases:- The bias of
$\hat\beta_{1}$ changes with$r_1$ when$r_2 = 1$ - The bias of
$\hat\beta_{2}$ changes with$r_2$ when$r_2 = 1$ - The bias of
$\hat\beta_{1}$ changes with$r_2$ when$x_1$ has some measurement error,$r_1 < 1$ .
- The bias of
-
The standard deviation of
$\hat{\beta}_1$ , i.e.$\sd(\hat{\beta})$ . -
The bias of the standard error of
$\hat{\beta}_1$ , i.e.$\E(\se(\hat{\beta}_1)) - \sd(\hat{\beta}_1)$
In these simulations our population is
$$
Y = \beta_0 + \beta_1 \vec{x}_1 + \beta_2 + \varepsilon
$$
where
In these simulations we will vary:
-
$n$ : The sample size -
$rho$ : The correlation between$x_1$ and$x_2$
simulate_multicollinearity <- function(iter, n, rho) {
beta <- c(0, 1, 1)
r2 <- 0.5
cormat <- matrix(c(1, rho, rho, 1), nrow = 2, ncol = 2)
dat <- mvnormal_data(n, mu = c(0, 0), cor = cormat)
yhat <- model.matrix(~ x1 + x2, data = dat) %*% beta
sigma <- r2_to_sigma(yhat, r2)
results <- vector(mode = "list", length = iter)
for (i in seq_len(iter)) {
# Simulate y
dat[["y"]] <- yhat + rnorm(n, sd = sigma)
# Estimate OLS
results <- tidy(lm(y ~ x1 + x2, data = dat))
}
summarize_params(results)
}
sims_multicollinearity <- expand.grid(rho = c(0, 0.7, 0.95),
obs = c(30, 100, 1000)) %>%
group_by(rho, obs) %>%
do({
simulate_multicollinearity(2048, rho = .$rho, n = .$obs)
})
save(sims_multicollinearity, file = "data/sims_multicollinearity.RData")
Use the simulation results to answer the following questions. Use plots in your answers.
How do the values of
- Bias of
$\hat{beta}_1$ ,$E(\hat{\beta}_1) - \beta_1$ . In particular, focus on the following cases: - The standard deviation of
$\hat{\beta}_1$ , i.e.$\sd(\hat{\beta})$ . - The bias of the standard error of
$\hat{\beta}_1$ , i.e.$\E(\se(\hat{\beta}_1)) - \sd(\hat{\beta}_1)$
- How do Nunn and Wantchekon handle omitted variable bias? Replicate their calculations for at least one regression?
- How would measurement error problems in the measure of trust used by Wantchekon affect their estimate of the effect of slave exports on trust? In their measures of slave exports? In the control variables?