From 263624ba9e11fa1e9013332433b240690d2e5883 Mon Sep 17 00:00:00 2001 From: "Christian B. Knudsen" Date: Tue, 16 Apr 2024 12:59:58 +0200 Subject: [PATCH 1/2] initial commit --- config.yaml | 1 + episodes/table1.Rmd | 38 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 39 insertions(+) create mode 100644 episodes/table1.Rmd diff --git a/config.yaml b/config.yaml index ee7914c3..f09ecdf7 100644 --- a/config.yaml +++ b/config.yaml @@ -69,6 +69,7 @@ episodes: - regression2.Rmd - regression3.Rmd - regression4.Rmd +- table1.Rmd # Information for Learners learners: diff --git a/episodes/table1.Rmd b/episodes/table1.Rmd new file mode 100644 index 00000000..462181f6 --- /dev/null +++ b/episodes/table1.Rmd @@ -0,0 +1,38 @@ +--- +title: 'table1' +teaching: 10 +exercises: 2 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- How do you write a lesson using R Markdown and `{sandpaper}`? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Explain how to use markdown with the new lesson template +- Demonstrate how to include pieces of code, figures, and nested challenge blocks + +:::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + + +:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor + +denne har vi kun med hvis der er medicinere + +:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: + + +::::::::::::::::::::::::::::::::::::: keypoints + +- Use `.md` files for episodes when you want static content +- Use `.Rmd` files for episodes when you need to generate output +- Run `sandpaper::check_lesson()` to identify any issues with your lesson +- Run `sandpaper::build_lesson()` to preview your lesson locally + +:::::::::::::::::::::::::::::::::::::::::::::::: + From 5410a708b469aed686fc5161ece60ca2f75ddc36 Mon Sep 17 00:00:00 2001 From: "Christian B. Knudsen" Date: Tue, 16 Apr 2024 14:46:21 +0200 Subject: [PATCH 2/2] ca. done. --- episodes/descript-stat.Rmd | 275 ++++++++++++++++++++++++++++++++++++- 1 file changed, 270 insertions(+), 5 deletions(-) diff --git a/episodes/descript-stat.Rmd b/episodes/descript-stat.Rmd index de165e19..1e3434d1 100644 --- a/episodes/descript-stat.Rmd +++ b/episodes/descript-stat.Rmd @@ -29,19 +29,284 @@ associated with the lessons. They appear in the "Instructor View" :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: +Descriptive statistic involves summarising or describing a set of data. +It usually presents quantitative descriptions in a short form, and helps to +simplify large datasaets. -Hvad er deskriptive/summary-stats? +Most descriptive statistical parameters applies to just one variable in our +data, and includes: -lokalitet - median, gennemsnit -spredning - standardafvigelse, varians. IQR. +* Mean +* Median +* Mode +* Standard deviation +* Range +* Variance +* Quartiles - especially the Interquartile Range, IQR +* Skewness +* Kurtosis +* Percentiles -Men også forekomst af NA-værdier. +The easiest way to get summary statistics on data is to use the `summarise` function +from the `tidyverse` package. +```{r} +library(tidyverse) +``` + + +In the following we are working with the `palmerpenguins` dataset: +```{r} +library(palmerpenguins) +``` + + +## Mean +The average of all datapoints: + +```{r} +penguins %>% + summarise(avg_mass = mean(body_mass_g, na.rm = T)) +``` +The variable contains missing data, encoded as NA, which we remove from the +data before calculating the mean. + +Barring significant outliers, `mean` is an expression of position of the data. +This is the weight we would expect a random penguin in our dataset to have. We +also have three different species of penguins in the dataset, and they have +quite different average weights. There is also a significant difference in the +average weight for the two sexes. We will get to that at the end of this +segment. + + +## Median +Similarly to the average/mean, the `median` is an expression of the location +of the data. If we order our data by size, from the smallest to the largest +value, and locate the middle observation, we get the median. This is the +value that half of the observations is smaller than. And half the observations +is larger. + +```{r} +penguins %>% + summarise(median = median(body_mass_g, na.rm = T)) +``` +We can note that the mean is larger that the median. This indicates that the data +is skewed, in this case toward the larger penguins. + +## Mode + +Mode is the most common, or frequently occurring, observation. R does not have +a build-in function for this, but we can easily find the mode by counting the +different observations,and locating the most common one. We typically do not use +this for continous variables. The mode of the `sex` variable in this dataset +can be found like this: +```{r} +penguins %>% + count(sex) %>% + arrange(desc(n)) +``` + +We count the different values in the `sex` variable, and arrange the counts +in descending order (`desc`). The mode of the `sex` variable is male. + +In this specific case, we note that the dataset is pretty balanced regarding the +two sexes. +## Standard deviation +The standard deviation is a measurement of the variation or dispersion in the +values. We have a mean, but how much variation do we actually have? + +A histogram is typically a good way to visualise the data: +```{r} +penguins %>% + ggplot(aes(body_mass_g)) + + geom_histogram() +``` + +We see a large peak around 3500 gram, but there is a lot of variation in the data, +that is, it is spread out. + +NOTE: We are still looking at three different species of penguins, with two sexes, +and it might be more reasonable to look only at one species, and one sex. We'll +get to that. + +The standard deviation is easily found similarly to the previous examples: ```{r} -levels(palmerpenguins::penguins$sex) +penguins %>% + summarise(stddev = sd(body_mass_g, na.rm = T)) ``` +The standard deviation is found by subtracting the mean from every observation, +squaring the results, and calculating the mean of these squared deviations from +the mean. Finally we take the square-root of the sum, and get the standard deviation. + +## Variance +The variance is similar to the standard deviation. Actually we have already +calculated it above; the standard deviation is the square-root of the variance. + +Since the standard deviation occurs in several statistical tests, it is more +frequently used that the variance. It is also more intuitively relateable to +the mean. + +NOTE - måske vi skal have variansen nævnt før standardafvigelsen? + +The variance is calculated similarly to the previous examples: + +```{r} +penguins %>% + summarise(var = var(body_mass_g, na.rm = T)) +``` + +## Range +The range of the data is simply the smallest and larges observations. +Since this returns more than one value, we use the function reframe +instead of summarise: +```{r} +penguins %>% + reframe(range = range(body_mass_g, na.rm = T)) +``` + +However it is typically more usefull to extract the two values to +separate columns in the output: +```{r} +penguins %>% + summarise(min = min(body_mass_g, na.rm = T), + max = max(body_mass_g, na.rm = T)) +``` +:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor +This also illustrates for the learners that we can calculate more than one +summary statistics in one summarise function. + +:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: + +The range, like the variance and the standard deviation informs us of the +spread of the observations. +## Quartiles - especially the Interquartile Range, IQR +We saw the median. That was the value where 50% of the observations in the +dataset were smaller. That is also called the 50% quantile. + +Similarly we have other quantiles. The 25% quantile is the value where 25% of +the observations are smaller. And the 75% quantile is where 75% of the values +are smaller. We are typically interested in the interquartile range, that is +the range, in which we find 50% of the observations, centered around the median. + +Like previous summary statistics, we get it using the summarise function: + +```{r} +penguins %>% + summarise(iqr = IQR(body_mass_g, na.rm = T)) + + +``` + +It is often interesting to also know the values for the 25% quartile (also +called the first quartile), and the 75% quartile (aka third quartile). +The function `quantile` allows us to specify exactly which quantile we +want: + +```{r} +quantile(penguins$body_mass_g, probs = .25, na.rm = T) +``` +```{r} +quantile(penguins$body_mass_g, probs = .75, na.rm = T) +``` +:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor +probs because if we select a random penguin, we have a 25% chance of selecting +a penguin that weighs less than 3550 gram. +This ties in to percentiles and qq-plots. +:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: + +## Skewness + +We previously saw a histogram of the data, and noted that the observations +were skewed to the left, and that the "tail" on the right was longer than +on the left. That skewness can be quantised. + +There is no function for skewness build into R, but we can get it from the +librar `e1071` + +```{r} +library(e1071) +skewness(penguins$body_mass_g, na.rm = T) +``` +The skewness is positive, indicating that the data are skewed to the left, just +as we saw. A negative skewness would indicate that the data skew to the right. + +## Kurtosis + +Og her skal vi nok lige have styr på den intuitive forståelse af det. + +## Percentiles + +Just as with the interquartile range, where we calculated the 25% and 75% +quantiles, we can calculate any quantile. + +Here eg. the 1% quantile: + +```{r} +quantile(penguins$body_mass_g, probs = .01, na.rm = T) + +``` +or, the value where 1% of the observations are smaller. + +## Everythin Everywhere all at once + +A lot of these descriptive values can be gotten for every variable in the +dataset using the `summary` function: +```{r} +summary(penguins) +``` +Here we get the range, the 1st and 3rd quantiles (and from those the IQR), +the median and the mean and, rather useful, the number of missing values in +each variable. + +We can also get all the descriptive values in one table, by adding more than +one summarising function to the summarise function: + +```{r} +penguins %>% + summarise(min = min(body_mass_g, na.rm = T), + max = max(body_mass_g, na.rm = T), + mean = mean(body_mass_g, na.rm = T), + median = median(body_mass_g, na.rm = T), + stddev = sd(body_mass_g, na.rm = T), + var = var(body_mass_g, na.rm = T), + Q1 = quantile(body_mass_g, probs = .25, na.rm = T), + Q3 = quantile(body_mass_g, probs = .75, na.rm = T), + iqr = IQR(body_mass_g, na.rm = T), + skew = skewness(body_mass_g, na.rm = T), + kurtosis = kurtosis(body_mass_g, na.rm = T) + ) +``` +Which leads us to a table that is a bit too wide. + +As noted, we have three different species of penguins in the dataset. Their +weight varies a lot. If we want to do the summarising on each for the species, +we can group the data by species, before summarising: + +```{r} +penguins %>% + group_by(species) %>% + summarise(min = min(body_mass_g, na.rm = T), + max = max(body_mass_g, na.rm = T), + mean = mean(body_mass_g, na.rm = T), + median = median(body_mass_g, na.rm = T), + stddev = sd(body_mass_g, na.rm = T) + ) +``` +We have removed some summary statistics in order to get a smaller table. + +Finally boxplots offers a way of visualising some of the summary statistics: +```{r} +penguins %>% + ggplot(aes(x=body_mass_g, y = sex)) + + geom_boxplot() +``` +The boxplot shows us the median (the fat line in the middel of each box)m, the +1st and 3rd quartiles (the ends of the boxes), and the range, with the whiskers +at each end of the boxes, illustrating the minimum and maximum. Any observations, +more than 1.5 times the IQR from either the 1st or 3rd quartiles, are deemed as +outliers and would be plotted as individual points in the plot. ::::::::::::::::::::::::::::::::::::: keypoints