Skip to content

Commit

Permalink
Merge pull request #10 from chrbknudsen/main
Browse files Browse the repository at this point in the history
Descriptiv statistik klar til korrekturlæsning
  • Loading branch information
chrbknudsen authored Apr 16, 2024
2 parents 28e9533 + 5410a70 commit 8c25c88
Show file tree
Hide file tree
Showing 3 changed files with 309 additions and 5 deletions.
1 change: 1 addition & 0 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ episodes:
- regression2.Rmd
- regression3.Rmd
- regression4.Rmd
- table1.Rmd

# Information for Learners
learners:
Expand Down
275 changes: 270 additions & 5 deletions episodes/descript-stat.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,19 +29,284 @@ associated with the lessons. They appear in the "Instructor View"

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Descriptive statistic involves summarising or describing a set of data.
It usually presents quantitative descriptions in a short form, and helps to
simplify large datasaets.

Hvad er deskriptive/summary-stats?
Most descriptive statistical parameters applies to just one variable in our
data, and includes:

lokalitet - median, gennemsnit
spredning - standardafvigelse, varians. IQR.
* Mean
* Median
* Mode
* Standard deviation
* Range
* Variance
* Quartiles - especially the Interquartile Range, IQR
* Skewness
* Kurtosis
* Percentiles

Men også forekomst af NA-værdier.
The easiest way to get summary statistics on data is to use the `summarise` function
from the `tidyverse` package.
```{r}
library(tidyverse)
```


In the following we are working with the `palmerpenguins` dataset:
```{r}
library(palmerpenguins)
```


## Mean
The average of all datapoints:

```{r}
penguins %>%
summarise(avg_mass = mean(body_mass_g, na.rm = T))
```
The variable contains missing data, encoded as NA, which we remove from the
data before calculating the mean.

Barring significant outliers, `mean` is an expression of position of the data.
This is the weight we would expect a random penguin in our dataset to have. We
also have three different species of penguins in the dataset, and they have
quite different average weights. There is also a significant difference in the
average weight for the two sexes. We will get to that at the end of this
segment.


## Median
Similarly to the average/mean, the `median` is an expression of the location
of the data. If we order our data by size, from the smallest to the largest
value, and locate the middle observation, we get the median. This is the
value that half of the observations is smaller than. And half the observations
is larger.

```{r}
penguins %>%
summarise(median = median(body_mass_g, na.rm = T))
```
We can note that the mean is larger that the median. This indicates that the data
is skewed, in this case toward the larger penguins.

## Mode

Mode is the most common, or frequently occurring, observation. R does not have
a build-in function for this, but we can easily find the mode by counting the
different observations,and locating the most common one. We typically do not use
this for continous variables. The mode of the `sex` variable in this dataset
can be found like this:
```{r}
penguins %>%
count(sex) %>%
arrange(desc(n))
```

We count the different values in the `sex` variable, and arrange the counts
in descending order (`desc`). The mode of the `sex` variable is male.

In this specific case, we note that the dataset is pretty balanced regarding the
two sexes.

## Standard deviation
The standard deviation is a measurement of the variation or dispersion in the
values. We have a mean, but how much variation do we actually have?

A histogram is typically a good way to visualise the data:
```{r}
penguins %>%
ggplot(aes(body_mass_g)) +
geom_histogram()
```

We see a large peak around 3500 gram, but there is a lot of variation in the data,
that is, it is spread out.

NOTE: We are still looking at three different species of penguins, with two sexes,
and it might be more reasonable to look only at one species, and one sex. We'll
get to that.

The standard deviation is easily found similarly to the previous examples:
```{r}
levels(palmerpenguins::penguins$sex)
penguins %>%
summarise(stddev = sd(body_mass_g, na.rm = T))
```
The standard deviation is found by subtracting the mean from every observation,
squaring the results, and calculating the mean of these squared deviations from
the mean. Finally we take the square-root of the sum, and get the standard deviation.

## Variance
The variance is similar to the standard deviation. Actually we have already
calculated it above; the standard deviation is the square-root of the variance.

Since the standard deviation occurs in several statistical tests, it is more
frequently used that the variance. It is also more intuitively relateable to
the mean.

NOTE - måske vi skal have variansen nævnt før standardafvigelsen?

The variance is calculated similarly to the previous examples:

```{r}
penguins %>%
summarise(var = var(body_mass_g, na.rm = T))
```

## Range
The range of the data is simply the smallest and larges observations.
Since this returns more than one value, we use the function reframe
instead of summarise:
```{r}
penguins %>%
reframe(range = range(body_mass_g, na.rm = T))
```

However it is typically more usefull to extract the two values to
separate columns in the output:
```{r}
penguins %>%
summarise(min = min(body_mass_g, na.rm = T),
max = max(body_mass_g, na.rm = T))
```
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor
This also illustrates for the learners that we can calculate more than one
summary statistics in one summarise function.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

The range, like the variance and the standard deviation informs us of the
spread of the observations.

## Quartiles - especially the Interquartile Range, IQR

We saw the median. That was the value where 50% of the observations in the
dataset were smaller. That is also called the 50% quantile.

Similarly we have other quantiles. The 25% quantile is the value where 25% of
the observations are smaller. And the 75% quantile is where 75% of the values
are smaller. We are typically interested in the interquartile range, that is
the range, in which we find 50% of the observations, centered around the median.

Like previous summary statistics, we get it using the summarise function:

```{r}
penguins %>%
summarise(iqr = IQR(body_mass_g, na.rm = T))
```

It is often interesting to also know the values for the 25% quartile (also
called the first quartile), and the 75% quartile (aka third quartile).
The function `quantile` allows us to specify exactly which quantile we
want:

```{r}
quantile(penguins$body_mass_g, probs = .25, na.rm = T)
```
```{r}
quantile(penguins$body_mass_g, probs = .75, na.rm = T)
```
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor
probs because if we select a random penguin, we have a 25% chance of selecting
a penguin that weighs less than 3550 gram.
This ties in to percentiles and qq-plots.
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

## Skewness

We previously saw a histogram of the data, and noted that the observations
were skewed to the left, and that the "tail" on the right was longer than
on the left. That skewness can be quantised.

There is no function for skewness build into R, but we can get it from the
librar `e1071`

```{r}
library(e1071)
skewness(penguins$body_mass_g, na.rm = T)
```
The skewness is positive, indicating that the data are skewed to the left, just
as we saw. A negative skewness would indicate that the data skew to the right.

## Kurtosis

Og her skal vi nok lige have styr på den intuitive forståelse af det.

## Percentiles

Just as with the interquartile range, where we calculated the 25% and 75%
quantiles, we can calculate any quantile.

Here eg. the 1% quantile:

```{r}
quantile(penguins$body_mass_g, probs = .01, na.rm = T)
```
or, the value where 1% of the observations are smaller.

## Everythin Everywhere all at once

A lot of these descriptive values can be gotten for every variable in the
dataset using the `summary` function:
```{r}
summary(penguins)
```
Here we get the range, the 1st and 3rd quantiles (and from those the IQR),
the median and the mean and, rather useful, the number of missing values in
each variable.

We can also get all the descriptive values in one table, by adding more than
one summarising function to the summarise function:

```{r}
penguins %>%
summarise(min = min(body_mass_g, na.rm = T),
max = max(body_mass_g, na.rm = T),
mean = mean(body_mass_g, na.rm = T),
median = median(body_mass_g, na.rm = T),
stddev = sd(body_mass_g, na.rm = T),
var = var(body_mass_g, na.rm = T),
Q1 = quantile(body_mass_g, probs = .25, na.rm = T),
Q3 = quantile(body_mass_g, probs = .75, na.rm = T),
iqr = IQR(body_mass_g, na.rm = T),
skew = skewness(body_mass_g, na.rm = T),
kurtosis = kurtosis(body_mass_g, na.rm = T)
)
```
Which leads us to a table that is a bit too wide.

As noted, we have three different species of penguins in the dataset. Their
weight varies a lot. If we want to do the summarising on each for the species,
we can group the data by species, before summarising:

```{r}
penguins %>%
group_by(species) %>%
summarise(min = min(body_mass_g, na.rm = T),
max = max(body_mass_g, na.rm = T),
mean = mean(body_mass_g, na.rm = T),
median = median(body_mass_g, na.rm = T),
stddev = sd(body_mass_g, na.rm = T)
)
```
We have removed some summary statistics in order to get a smaller table.

Finally boxplots offers a way of visualising some of the summary statistics:
```{r}
penguins %>%
ggplot(aes(x=body_mass_g, y = sex)) +
geom_boxplot()
```
The boxplot shows us the median (the fat line in the middel of each box)m, the
1st and 3rd quartiles (the ends of the boxes), and the range, with the whiskers
at each end of the boxes, illustrating the minimum and maximum. Any observations,
more than 1.5 times the IQR from either the 1st or 3rd quartiles, are deemed as
outliers and would be plotted as individual points in the plot.

::::::::::::::::::::::::::::::::::::: keypoints

Expand Down
38 changes: 38 additions & 0 deletions episodes/table1.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
title: 'table1'
teaching: 10
exercises: 2
---

:::::::::::::::::::::::::::::::::::::: questions

- How do you write a lesson using R Markdown and `{sandpaper}`?

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: objectives

- Explain how to use markdown with the new lesson template
- Demonstrate how to include pieces of code, figures, and nested challenge blocks

::::::::::::::::::::::::::::::::::::::::::::::::

## Introduction


:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor

denne har vi kun med hvis der er medicinere

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


::::::::::::::::::::::::::::::::::::: keypoints

- Use `.md` files for episodes when you want static content
- Use `.Rmd` files for episodes when you need to generate output
- Run `sandpaper::check_lesson()` to identify any issues with your lesson
- Run `sandpaper::build_lesson()` to preview your lesson locally

::::::::::::::::::::::::::::::::::::::::::::::::

0 comments on commit 8c25c88

Please sign in to comment.