Merge pull request #10 from chrbknudsen/main

Descriptiv statistik klar til korrekturlæsning
KUBDatalab · Apr 16, 2024 · 8c25c88 · 8c25c88
2 parents 28e9533 + 5410a70
commit 8c25c88
Show file tree

Hide file tree

Showing 3 changed files with 309 additions and 5 deletions.
diff --git a/config.yaml b/config.yaml
@@ -69,6 +69,7 @@ episodes:
 - regression2.Rmd
 - regression3.Rmd
 - regression4.Rmd
+- table1.Rmd
 
 # Information for Learners
 learners: 

diff --git a/episodes/descript-stat.Rmd b/episodes/descript-stat.Rmd
@@ -29,19 +29,284 @@ associated with the lessons. They appear in the "Instructor View"
 
 ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 
+Descriptive statistic involves summarising or describing a set of data. 
+It usually presents quantitative descriptions in a short form, and helps to
+simplify large datasaets.
 
-Hvad er deskriptive/summary-stats?
+Most descriptive statistical parameters applies to just one variable in our
+data, and includes:
 
-lokalitet - median, gennemsnit
-spredning - standardafvigelse, varians. IQR.
+* Mean
+* Median
+* Mode
+* Standard deviation
+* Range
+* Variance
+* Quartiles - especially the Interquartile Range, IQR
+* Skewness
+* Kurtosis
+* Percentiles
 
-Men også forekomst af NA-værdier.
+The easiest way to get summary statistics on data is to use the `summarise` function
+from the `tidyverse` package.
+```{r}
+library(tidyverse)
+```
+
+
+In the following we are working with the `palmerpenguins` dataset:
+```{r}
+library(palmerpenguins)
+```
+
+
+## Mean
+The average of all datapoints:
+
+```{r}
+penguins %>% 
+  summarise(avg_mass = mean(body_mass_g, na.rm = T))
+```
+The variable contains missing data, encoded as NA, which we remove from the
+data before calculating the mean.
+
+Barring significant outliers, `mean` is an expression of position of the data.
+This is the weight we would expect a random penguin in our dataset to have. We
+also have three different species of penguins in the dataset, and they have 
+quite different average weights. There is also a significant difference in the
+average weight for the two sexes. We will get to that at the end of this 
+segment.
+
+
+## Median
+Similarly to the average/mean, the `median` is an expression of the location 
+of the data. If we order our data by size, from the smallest to the largest
+value, and locate the middle observation, we get the median. This is the 
+value that half of the observations is smaller than. And half the observations
+is larger.
+
+```{r}
+penguins %>% 
+  summarise(median = median(body_mass_g, na.rm = T))
+```
+We can note that the mean is larger that the median. This indicates that the data
+is skewed, in this case toward the larger penguins. 
+
+## Mode
+
+Mode is the most common, or frequently occurring, observation. R does not have 
+a build-in function for this, but we can easily find the mode by counting the
+different observations,and locating the most common one. We typically do not use
+this for continous variables. The mode of the `sex` variable in this dataset
+can be found like this:
+```{r}
+penguins %>% 
+  count(sex) %>% 
+  arrange(desc(n))
+```
+
+We count the different values in the `sex` variable, and arrange the counts
+in descending order (`desc`). The mode of the `sex` variable is male.
+
+In this specific case, we note that the dataset is pretty balanced regarding the
+two sexes.
 
+## Standard deviation
+The standard deviation is a measurement of the variation or dispersion in the
+values. We have a mean, but how much variation do we actually have?
+
+A histogram is typically a good way to visualise the data:
+```{r}
+penguins %>% 
+  ggplot(aes(body_mass_g)) +
+  geom_histogram()
+```
+
+We see a large peak around 3500 gram, but there is a lot of variation in the data,
+that is, it is spread out.
+
+NOTE: We are still looking at three different species of penguins, with two sexes,
+and it might be more reasonable to look only at one species, and one sex. We'll
+get to that.
+
+The standard deviation is easily found similarly to the previous examples:
 ```{r}
-levels(palmerpenguins::penguins$sex)
+penguins %>% 
+  summarise(stddev = sd(body_mass_g, na.rm = T))
 ```
+The standard deviation is found by subtracting the mean from every observation,
+squaring the results, and calculating the mean of these squared deviations from
+the mean. Finally we take the square-root of the sum, and get the standard deviation.
+
+## Variance
+The variance is similar to the standard deviation. Actually we have already 
+calculated it above; the standard deviation is the square-root of the variance.
+
+Since the standard deviation occurs in several statistical tests, it is more
+frequently used that the variance. It is also more intuitively relateable to 
+the mean.
+
+NOTE - måske vi skal have variansen nævnt før standardafvigelsen?
+
+The variance is calculated similarly to the previous examples:
+
+```{r}
+penguins %>% 
+  summarise(var = var(body_mass_g, na.rm = T))
+```
+
+## Range
+The range of the data is simply the smallest and larges observations.
+Since this returns more than one value, we use the function reframe
+instead of summarise:
+```{r}
+penguins %>% 
+  reframe(range = range(body_mass_g, na.rm = T))
+```
+
+However it is typically more usefull to extract the two values to 
+separate columns in the output:
+```{r}
+penguins %>% 
+  summarise(min = min(body_mass_g, na.rm = T),
+            max = max(body_mass_g, na.rm = T))
+```
+:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor
+This also illustrates for the learners that we can calculate more than one
+summary statistics in one summarise function.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+
+The range, like the variance and the standard deviation informs us of the
+spread of the observations.
 
+## Quartiles - especially the Interquartile Range, IQR
 
+We saw the median. That was the value where 50% of the observations in the 
+dataset were smaller. That is also called the 50% quantile. 
+
+Similarly we have other quantiles. The 25% quantile is the value where 25% of
+the observations are smaller. And the 75% quantile is where 75% of the values
+are smaller. We are typically interested in the interquartile range, that is
+the range, in which we find 50% of the observations, centered around the median.
+
+Like previous summary statistics, we get it using the summarise function:
+
+```{r}
+penguins %>% 
+  summarise(iqr = IQR(body_mass_g, na.rm = T))
+
+
+```
+
+It is often interesting to also know the values for the 25% quartile (also 
+called the first quartile), and the 75% quartile (aka third quartile).
+The function `quantile` allows us to specify exactly which quantile we 
+want:
+
+```{r}
+quantile(penguins$body_mass_g, probs = .25, na.rm = T)
+```
+```{r}
+quantile(penguins$body_mass_g, probs = .75, na.rm = T)
+```
+:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor
+probs because if we select a random penguin, we have a 25% chance of selecting
+a penguin that weighs less than 3550 gram.
+This ties in to percentiles and qq-plots.
+::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Skewness
+
+We previously saw a histogram of the data, and noted that the observations 
+were skewed to the left, and that the "tail" on the right was longer than
+on the left. That skewness can be quantised.
+
+There is no function for skewness build into R, but we can get it from the
+librar `e1071`
+
+```{r}
+library(e1071)
+skewness(penguins$body_mass_g, na.rm = T)
+```
+The skewness is positive, indicating that the data are skewed to the left, just
+as we saw. A negative skewness would indicate that the data skew to the right.
+
+## Kurtosis
+
+Og her skal vi nok lige have styr på den intuitive forståelse af det.
+
+## Percentiles
+
+Just as with the interquartile range, where we calculated the 25% and 75% 
+quantiles, we can calculate any quantile. 
+
+Here eg. the 1% quantile:
+
+```{r}
+quantile(penguins$body_mass_g, probs = .01, na.rm = T)
+
+```
+or, the value where 1% of the observations are smaller.
+
+## Everythin Everywhere all at once
+
+A lot of these descriptive values can be gotten for every variable in the 
+dataset using the `summary` function:
+```{r}
+summary(penguins)
+```
+Here we get the range, the 1st and 3rd quantiles (and from those the IQR), 
+the median and the mean and, rather useful, the number of missing values in 
+each variable.
+
+We can also get all the descriptive values in one table, by adding more than
+one summarising function to the summarise function:
+
+```{r}
+penguins %>% 
+  summarise(min = min(body_mass_g, na.rm = T),
+            max = max(body_mass_g, na.rm = T),
+            mean = mean(body_mass_g, na.rm = T),
+            median = median(body_mass_g, na.rm = T),
+            stddev = sd(body_mass_g, na.rm = T),
+            var = var(body_mass_g, na.rm = T),
+            Q1 = quantile(body_mass_g, probs = .25, na.rm = T),
+            Q3 = quantile(body_mass_g, probs = .75, na.rm = T),
+            iqr = IQR(body_mass_g, na.rm = T),
+            skew = skewness(body_mass_g, na.rm = T),
+            kurtosis = kurtosis(body_mass_g, na.rm = T)
+  )
+```
+Which leads us to a table that is a bit too wide.
+
+As noted, we have three different species of penguins in the dataset. Their 
+weight varies a lot. If we want to do the summarising on each for the species,
+we can group the data by species, before summarising:
+
+```{r}
+penguins %>% 
+  group_by(species) %>% 
+  summarise(min = min(body_mass_g, na.rm = T),
+            max = max(body_mass_g, na.rm = T),
+            mean = mean(body_mass_g, na.rm = T),
+            median = median(body_mass_g, na.rm = T),
+            stddev = sd(body_mass_g, na.rm = T)
+  )
+```
+We have removed some summary statistics in order to get a smaller table.
+
+Finally boxplots offers a way of visualising some of the summary statistics:
+```{r}
+penguins %>% 
+  ggplot(aes(x=body_mass_g, y = sex)) +
+  geom_boxplot()
+```
+The boxplot shows us the median (the fat line in the middel of each box)m, the 
+1st and 3rd quartiles (the ends of the boxes), and the range, with the whiskers
+at each end of the boxes, illustrating the minimum and maximum. Any observations,
+more than 1.5 times the IQR from either the 1st or 3rd quartiles, are deemed as 
+outliers and would be plotted as individual points in the plot.
 
 ::::::::::::::::::::::::::::::::::::: keypoints 
 

diff --git a/episodes/table1.Rmd b/episodes/table1.Rmd
@@ -0,0 +1,38 @@
+---
+title: 'table1'
+teaching: 10
+exercises: 2
+---
+
+:::::::::::::::::::::::::::::::::::::: questions 
+
+- How do you write a lesson using R Markdown and `{sandpaper}`?
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::: objectives
+
+- Explain how to use markdown with the new lesson template
+- Demonstrate how to include pieces of code, figures, and nested challenge blocks
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Introduction
+
+
+:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor
+
+denne har vi kun med hvis der er medicinere
+
+::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+::::::::::::::::::::::::::::::::::::: keypoints 
+
+- Use `.md` files for episodes when you want static content
+- Use `.Rmd` files for episodes when you need to generate output
+- Run `sandpaper::check_lesson()` to identify any issues with your lesson
+- Run `sandpaper::build_lesson()` to preview your lesson locally
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+