Skip to content

Commit

Permalink
Clean up purling
Browse files Browse the repository at this point in the history
  • Loading branch information
ismayc committed Oct 20, 2024
1 parent ecbf5f8 commit 43d9586
Showing 1 changed file with 15 additions and 26 deletions.
41 changes: 15 additions & 26 deletions 07-sampling.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ Observe that for each student `group` the data frame provides their names, the n

Using again the R data visualization techniques introduced in Chapter \@ref(viz), we construct the histogram for all 33 sample proportions as shown in Figure \@ref(fig:samplingdistribution-tactile). Recall that each student has a sample of 50 balls using the same procedure and has calculated the proportion of red balls in each sample. The histogram is built using only those sample proportions. We do not need the individual information of each student or the number of red balls found. We constructed the histogram using `ggplot()` with `geom_histogram()`. To align the bins in the computerized histogram version so it matches the hand-drawn histogram shown in Figure \@ref(fig:sampling-exercise-5), the arguments `boundary = 0.4` and `binwidth = 0.05` were used. The former indicates that we want a binning scheme, such that, one of the bins' boundaries is at 0.4; the latter fixes the width of the bin to 0.05 units.

```{r eval=FALSE}
```{r echo=TRUE, fig.show='hide'}
ggplot(tactile_prop_red, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
labs(x = "Proportion of red balls in each sample",
Expand All @@ -200,11 +200,6 @@ We can also use this activity to introduce some statistical terminology. The pro
As shown in Figure \@ref(fig:samplingdistribution-tactile), different random samples produce different sample proportions. This phenomenon is called *sampling variation*\index{sampling!variation}. Furthermore, the histogram is a graphical representation of the *distribution* of sample proportions; it describes the sample proportions determined and how often they appear. The distribution of all possible sample proportions that can be found from random samples is called, appropriately, the *sampling distribution* of the sample proportion. The sampling distribution is central to the ideas we develop in this chapter.


<!--
Need to review Learning checks AV 9-13-23
-->


```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
Expand Down Expand Up @@ -248,7 +243,7 @@ We compute the proportion of red balls in our virtual sample. The code we use is


```{r echo=-c(1, 2)}
# Neat way to remove from output particular code pieces!
# Neat way to remove from output of particular code pieces with echo=-c(1, 2)!
prop_red_sample1 <- virtual_shovel |>
summarize(prop_red = mean(color == "red")) |>
pull(prop_red)
Expand Down Expand Up @@ -297,7 +292,7 @@ virtual_prop_red

As was the case in the tactile activity, there is sampling variation in the resulting 33 proportions from the virtual samples. As we did manually in Subsection \@ref(sampling-simulation), we construct a histogram with these sample proportions as shown in Figure \@ref(fig:samplingdistribution-virtual). The histogram helps us visualize the sampling distribution of the sample proportion. Observe again the histogram was constructed using `ggplot()`, `geom_histogram()`, and including the arguments `binwidth = 0.05` and `boundary = 0.4`.

```{r eval=FALSE}
```{r echo=TRUE, fig.show='hide'}
ggplot(virtual_prop_red, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
labs(x = "Sample proportion",
Expand Down Expand Up @@ -376,7 +371,7 @@ virtual_prop_red

As done previously, a histogram for these 1000 sample proportions is given in Figure \@ref(fig:samplingdistribution-virtual-1000).

```{r eval=FALSE}
```{r echo=TRUE, fig.show='hide'}
ggplot(virtual_prop_red, aes(x = prop_red)) +
geom_histogram(binwidth = 0.04, boundary = 0.4, color = "white") +
labs(x = "Sample proportion", title = "Histogram of 1000 sample proportions")
Expand All @@ -392,7 +387,7 @@ virtual_histogram +
```

The sample proportions represented by the histogram could be as low as 15% or as high as 60%, but those extreme proportions are rare. The most frequent proportions determined are those between 35% and 40%. Furthermore, the histogram now shows a symmetric and bell-shaped distribution that can be approximated well by a normal distribution.
```{r echo=FALSE, results="asis"}
```{r echo=FALSE, results="asis", purl=FALSE}
if(!is_latex_output())
cat('Please read the "Normal distribution" section of ([Appendix A online](https://moderndive.com/v2/appendixa)) for a brief discussion of this distribution and its properties.')
```
Expand Down Expand Up @@ -738,14 +733,14 @@ It is worth spending a moment understanding this result. If we take one random s

We present the equivalent results with samples of size 50 and 100:

```{r eval=FALSE}
```{r echo=TRUE, results='hide'}
virtual_prop_red_50 |>
summarize(E_Xbar_50 = mean(prop_red))
virtual_prop_red_100 |>
summarize(E_Xbar_100 = mean(prop_red))
```

```{r echo=FALSE}
```{r echo=FALSE, purl=FALSE}
e_xbar_50 <- virtual_prop_red_50 |>
summarize(E_Xbar_50 = mean(prop_red))
e_xbar_100 <- virtual_prop_red_100 |>
Expand Down Expand Up @@ -857,7 +852,7 @@ $$SE(\overline X) = \sqrt{\frac{0.375\cdot(1-0.375)}{100}} = 0.0484$$

```{r}
p <- 0.375
sqrt(p*(1-p)/100)
sqrt(p * (1 - p) / 100)
```

This value is nearly identical to the result found on the simulation above. We repeat this exercise, this time finding the estimated standard error of $\overline X$ from the simulations done earlier. These simulations are stored in data frames `virtual_prop_red_25` and `virtual_prop_red_50`, when the sample sizes used are $n=25$ and $n=50$, respectively:
Expand Down Expand Up @@ -1083,10 +1078,7 @@ include_graphics("images/sampling/almonds/twenty-five-almonds.png")

Since the total weight is 88.6 grams, as shown in the Figure \@ref(fig:twenty-five-almonds), the sample mean weight will be $88.6/25 = 3.544$. The `moderndive` \index{R packages!moderndive!almonds\_sample} package contains the information of this sample in the `almonds_sample` data frame. Here, we present the weight of the first 10 almonds in the sample:

```{r echo=2:4, eval=FALSE}
set.seed(2024)
almonds_sample <- almonds_bowl |>
rep_slice_sample(n = 25, reps = 1)
```{r}
almonds_sample
```

Expand All @@ -1095,11 +1087,9 @@ num_almonds <- length(almonds_sample$weight)
```

The `almonds_sample` data frame in the `moderndive` package has $n=$ `r num_almonds` rows corresponding to each almond in the sample shown in Figure \@ref(fig:twenty-five-almonds).
The first variable `replicate` indicates this is the first and only replicate since it is a single sample. The second variable `ID` gives an identification to the particular almond. The third column `weight` gives the corresponding weight for each almond in grams as a numeric variable, also known as a double (`dbl`).

The distribution of the weights of these `r num_almonds` are shown in the histogram in Figure \@ref(fig:almonds-sample-histogram).
The first variable `replicate` indicates this is the first and only replicate since it is a single sample. The second variable `ID` gives an identification to the particular almond. The third column `weight` gives the corresponding weight for each almond in grams as a numeric variable, also known as a double (`dbl`). The distribution of the weights of these `r num_almonds` are shown in the histogram in Figure \@ref(fig:almonds-sample-histogram).

```{r almonds-sample-histogram, fig.cap="Distribution of weight for a sample of 25 almonds."}
```{r almonds-sample-histogram, fig.cap="Distribution of weight for a sample of 25 almonds.", fig.height=ifelse(knitr::is_latex_output(), 1.5, 4)}
ggplot(almonds_sample, aes(x = weight)) +
geom_histogram(binwidth = 0.1, color = "white")
```
Expand Down Expand Up @@ -1132,7 +1122,7 @@ virtual_mean_weight

Figure \@ref(fig:sampling-mean-virtual-1000) presents the histogram for these sample means:

```{r eval=FALSE}
```{r echo=TRUE, fig.show='hide'}
ggplot(virtual_mean_weight, aes(x = mean_weight)) +
geom_histogram(binwidth = 0.04, boundary = 3.5, color = "white") +
labs(x = "Sample mean", title = "Histogram of 1000 sample means")
Expand Down Expand Up @@ -1192,7 +1182,6 @@ almonds_sample |>
So, once this sample was observed, the random variable $\overline X$ was realized as $\overline X = 3.67$, the **sample mean** was 3.67 grams.
Note that the possible values that $\overline X$ can take are all the possible sample means from samples of 25 almonds from the bowl. The chances of getting these sample means are determined by the configuration of almond weights in the bowl.


When $\overline X$ is constructed as the sample mean of a given random sample, the sampling distribution of the sample mean is precisely the distribution of $\overline X$. In this context, recall what we are interested in determining:

1. The center of the *distribution* of $\overline X$
Expand Down Expand Up @@ -1318,7 +1307,7 @@ almonds_bowl |>

And we do the same for our simulations next. Recall that the expected value of $\overline X$ is the value we would expect to observe, on average, when we take many sample means from random samples of a given size. It is located at the center of the distribution of $\overline X$. Similarly, the standard error of $\overline X$ is the measure of dispersion or magnitude of sampling variation. It is the standard deviation of the sample means calculated from all possible random samples of a given size. Using the data wrangling code `mean()` and `sd()` functions inside `summarize()` and applied to our simulation values, we can estimate the expected value and standard error of $\overline X$. Three sets of values are found, one for each of the corresponding sample sizes and presented in Table \@ref(tab:comparing-n1).

```{r, eval=FALSE}
```{r, results='hide'}
# n = 25
virtual_mean_weight_25 |>
summarize(E_Xbar_25 = mean(mean_weight), sd = sd(mean_weight))
Expand Down Expand Up @@ -1546,7 +1535,7 @@ prop_joined

As we did before, we build a histogram for these 1000 differences in Figure \@ref(fig:samplingdistribution-virtual-diff-1000).

```{r eval=FALSE}
```{r echo=TRUE, fig.show='hide'}
ggplot(prop_joined, aes(x = prop_diff)) +
geom_histogram(binwidth = 0.04, boundary = 0, color = "white") +
labs(x = "Difference in sample proportions",
Expand Down Expand Up @@ -1650,7 +1639,7 @@ if(!is_latex_output()) {
}
```

```{r echo=FALSE, results='asis'}
```{r echo=FALSE, results='asis', purl=FALSE}
if (is_latex_output())
cat("Check the online version of the book for a table that also includes the sampling distribution of each of these statistics using the Central Limit Theorem.")
```
Expand Down

0 comments on commit 43d9586

Please sign in to comment.