diff --git a/index.Rmd b/index.Rmd index 3349b4d..475016a 100644 --- a/index.Rmd +++ b/index.Rmd @@ -4,7 +4,7 @@ output: learnr::tutorial: progressive: true allow_skip: true - df_print: default + df_print: paged runtime: shiny_prerendered description: > Learn about the basics of summarising your data and merging different data sets together @@ -107,25 +107,25 @@ For this workbook we will be using four relational data tables, each storing dat At the top level we have some data about the Villages in our study ```{r, echo = FALSE} -Village_data %>%knitr::kable() +Village_data ``` Below this we have information on a number of farmers who took part in our study ```{r, echo = FALSE} -Farmer_data %>%knitr::kable() +Farmer_data ``` Further down still, we have our plot level data ```{r, echo = FALSE} -Plot_data %>%knitr::kable() +Plot_data ``` Lastly, we have some further information on fertilizers. ```{r, echo = FALSE} -Fertiliser_data %>%knitr::kable() +Fertiliser_data ``` @@ -212,10 +212,8 @@ Sum2 <- Plot_data %>% avg_area = mean(size, na.rm = TRUE), sd = sd(size, na.rm = TRUE), nplots = n()) -``` -```{r, echo = FALSE} -Sum2 %>%knitr::kable() +Sum2 ``` There are a number of other summarise functions that can be used to apply the same function to multiple columns rather than go through one by one. For more on this, please follow this [link](https://dplyr.tidyverse.org/reference/summarise_all.html) @@ -238,10 +236,8 @@ Sum3 <- Plot_data %>% sd = sd(size, na.rm = TRUE), nplots = n()) %>% arrange(farmer_id) -``` -```{r, echo = FALSE} -Sum3 %>%head(10) %>%knitr::kable() +Sum3 ``` @@ -250,10 +246,8 @@ This could be useful for creating a new variable that is at one level, but is al ```{r} Sum3 <- Sum3 %>% mutate(plot_area_prop = size/total_area) -``` -```{r,echo = FALSE} -Sum3 %>%head(10) %>%knitr::kable() +Sum3 ``` ## Binding data @@ -531,10 +525,8 @@ So in this case, we need an inner join. Because we only want to keep the rows th ```{r} Plot_data <- Plot_data %>% inner_join(Farmer_data, by = "farmer_id") -``` -```{r, echo = FALSE} -Plot_data %>%head(10) %>%knitr::kable() +Plot_data ``` So we have successfully brought down the farmer data, including both the village_id and the farmers names. @@ -544,10 +536,8 @@ We could extend this further and bring down the village level information as wel ```{r} Plot_data <- Plot_data %>% inner_join(Village_data, by = "village_id") -``` -```{r, echo = FALSE} -Plot_data %>%head(10) %>%knitr::kable() +Plot_data ``` Now we can use `group_by` and `summarise` to calculate that village level plot average diff --git a/index.html b/index.html index ca61a3a..1f3dec4 100644 --- a/index.html +++ b/index.html @@ -147,560 +147,33 @@
For this workbook we will be using four relational data tables, each storing data regarding different units of analysis. However each and everyone can be linked across this hierarchy.
At the top level we have some data about the Villages in our study
-village_id | -village_name | -region_name | -population | -
---|---|---|---|
1 | -Springfield | -A | -34387 | -
2 | -Langley Falls | -A | -462736 | -
3 | -New New York | -A | -134412 | -
4 | -Pawnee | -B | -446522 | -
5 | -Balamory | -B | -341729 | -
Below this we have information on a number of farmers in who took part in our study
-farmer_id | -village_id | -name | -
---|---|---|
1 | -3 | -Saadiq | -
2 | -3 | -Jelicha | -
3 | -1 | -Briana | -
4 | -5 | -Daeun | -
5 | -5 | -Ruwaida | -
6 | -2 | -Fawzi | -
7 | -2 | -Yashna | -
8 | -1 | -Tara | -
9 | -5 | -Melina | -
10 | -5 | -Jeelaan | -
11 | -1 | -Razeena | -
12 | -1 | -Sidqi | -
13 | -5 | -Eric | -
14 | -5 | -Ryann | -
15 | -2 | -Kingsley | -
16 | -2 | -Rifqa | -
17 | -1 | -Denise | -
18 | -4 | -Suhaib | -
19 | -1 | -Daniel | -
20 | -4 | -Sara | -
Further down still, we have our plot level data
-plot_id | -farmer_id | -size | -fertiliser_1 | -fertiliser_2 | -
---|---|---|---|---|
1 | -10 | -3.1 | -4 | -2 | -
2 | -17 | -4.1 | -2 | -1 | -
3 | -12 | -2.9 | -2 | -3 | -
4 | -7 | -2.5 | -4 | -3 | -
5 | -8 | -3.5 | -2 | -1 | -
6 | -1 | -1.9 | -4 | -1 | -
7 | -20 | -2.1 | -1 | -1 | -
8 | -3 | -3.9 | -1 | -1 | -
9 | -11 | -2.3 | -4 | -2 | -
10 | -1 | -3.3 | -4 | -3 | -
11 | -15 | -4.3 | -3 | -1 | -
12 | -6 | -1.3 | -4 | -2 | -
13 | -9 | -1.3 | -4 | -3 | -
14 | -5 | -1.7 | -1 | -1 | -
15 | -15 | -1.3 | -2 | -1 | -
16 | -18 | -2.7 | -4 | -1 | -
17 | -3 | -3.1 | -1 | -1 | -
18 | -3 | -1.5 | -1 | -1 | -
19 | -7 | -4.3 | -3 | -1 | -
20 | -15 | -3.5 | -1 | -1 | -
21 | -3 | -3.3 | -3 | -1 | -
22 | -20 | -0.7 | -1 | -1 | -
23 | -12 | -0.5 | -2 | -3 | -
24 | -18 | -0.9 | -2 | -1 | -
25 | -9 | -2.1 | -2 | -3 | -
26 | -19 | -1.5 | -4 | -2 | -
27 | -18 | -3.7 | -3 | -1 | -
28 | -19 | -4.3 | -1 | -1 | -
29 | -16 | -1.3 | -3 | -1 | -
30 | -7 | -2.1 | -1 | -1 | -
31 | -16 | -2.7 | -1 | -1 | -
32 | -20 | -4.5 | -1 | -1 | -
33 | -17 | -1.9 | -3 | -4 | -
34 | -20 | -1.1 | -1 | -1 | -
35 | -4 | -3.7 | -3 | -2 | -
36 | -2 | -2.1 | -2 | -1 | -
37 | -17 | -4.5 | -1 | -1 | -
38 | -1 | -0.5 | -4 | -2 | -
39 | -12 | -4.5 | -4 | -1 | -
40 | -7 | -4.3 | -3 | -4 | -
41 | -9 | -2.3 | -3 | -4 | -
42 | -14 | -3.1 | -4 | -2 | -
43 | -2 | -2.3 | -4 | -1 | -
44 | -2 | -4.5 | -2 | -4 | -
45 | -20 | -4.5 | -4 | -1 | -
46 | -18 | -4.7 | -1 | -1 | -
47 | -2 | -1.5 | -3 | -2 | -
48 | -11 | -3.9 | -2 | -4 | -
49 | -3 | -3.5 | -4 | -3 | -
50 | -3 | -0.7 | -4 | -3 | -
Below this we have information on a number of farmers who took part in our study
+Further down still, we have our plot level data
+Lastly, we have some further information on fertilizers.
-fertiliser_id | -name | -price_per_ha | -
---|---|---|
1 | -None | -0 | -
2 | -Compost | -12 | -
3 | -Manure | -8 | -
4 | -Chemical | -25 | -
With any data analysis, at some point you are going to have to summarise your data in some way. Sometimes you may need to even do this prior to analysis as part of your data cleaning process. such as for the generation of new variables.This is certainly to be true if you are handling relational data.
+With any data analysis, at some point you are going to have to summarise your data in some way. Sometimes you may need to even do this prior to analysis as part of your data cleaning process. such as for the generation of new variables.This is certain to be true if you are handling relational data.
More often than not, you will need to do this summary by some sort of dis/aggregation variable.
In order to do that in R, you first need to know how to group your data.
## # A tibble: 50 x 5
-## # Groups: farmer_id [19]
-## plot_id farmer_id size fertiliser_1 fertiliser_2
-## <int> <int> <dbl> <int> <dbl>
-## 1 1 10 3.1 4 2
-## 2 2 17 4.1 2 1
-## 3 3 12 2.9 2 3
-## 4 4 7 2.5 4 3
-## 5 5 8 3.5 2 1
-## 6 6 1 1.9 4 1
-## 7 7 20 2.1 1 1
-## 8 8 3 3.9 1 1
-## 9 9 11 2.3 4 2
-## 10 10 1 3.3 4 3
-## # ... with 40 more rows
+Looking at the data, it will look like nothing has changed. This is because the groups are implicit, it is in essence now part of the metadata of the data. But R will recognise that this grouping exists within the dataset and perform operations accordingly.
-Indeed if we take a look at the structure of our data using str
, we will see that rather than our data being a data.frame (the standard data format in R), it is now a grouped_df. From this we can see that we have created 19 groupings in our data and we can see for each group which rows belong to it.
Indeed if we take a look at the structure of our data using str
, which displays the structure of an R object, we will see we will see that rather than our data being a data.frame (the standard data format in R), it is now a grouped_df (grouped data frame). From this we can see that we have created 19 groupings in our data and we can see for each group which rows belong to it.
So our first group (farmer 1) has 3 rows in the data, and they are rows 6,10 and 38.
str(Plot_data)
## grouped_df [50 x 5] (S3: grouped_df/tbl_df/tbl/data.frame)
@@ -741,222 +204,68 @@ Group_by
## - attr(*, "groups")= tibble [19 x 2] (S3: tbl_df/tbl/data.frame)
## ..$ farmer_id: int [1:19] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ .rows : list<int> [1:19]
-## .. ..$ : int [1:3] 6 10 38
-## .. ..$ : int [1:4] 36 43 44 47
-## .. ..$ : int [1:6] 8 17 18 21 49 50
-## .. ..$ : int 35
-## .. ..$ : int 14
-## .. ..$ : int 12
-## .. ..$ : int [1:4] 4 19 30 40
-## .. ..$ : int 5
-## .. ..$ : int [1:3] 13 25 41
-## .. ..$ : int 1
-## .. ..$ : int [1:2] 9 48
-## .. ..$ : int [1:3] 3 23 39
-## .. ..$ : int 42
-## .. ..$ : int [1:3] 11 15 20
-## .. ..$ : int [1:2] 29 31
-## .. ..$ : int [1:3] 2 33 37
-## .. ..$ : int [1:4] 16 24 27 46
-## .. ..$ : int [1:2] 26 28
-## .. ..$ : int [1:5] 7 22 32 34 45
-## .. ..@ ptype: int(0)
-## ..- attr(*, ".drop")= logi TRUE
-If for any reason you want to then get rid of this grouping, perhaps at a later point you want to summarise across your whole data and not groups, or by a different grouping variable entirely.
-Then we can just add in ungroup
and our data will be ungrouped.
Later in this workbook we will also see how we can group by more than one variable.
-Now that we know how to group our variables, let’s look at how we can start aggregating our data.
-The simplest way to start creating data summaries is to use the summarise
function from the dplyr package.
This will generate summary statistics that you define for each of the groups in your data, or for your whole dataset if your data is not grouped.
-Grammatically, it works in a similar way to the mutate
function, we first provide a new name for our summary statistic. In this case we have decided to calculate the total area of all of the farmers plots.
On the other side of the equals sign, we right the calculation/function that we want to make.
-As we want a total, we use the function sum
to sum up all the area sizes. Note that we have added the argument na.rm = TRUE
this makes sure that any missing values are removed from the calculation. If this is not included and there is missing data, then our result would just read NA
and not give an actual number.
Sum1 <- Plot_data %>%
- group_by(farmer_id) %>%
- summarise(total_area = sum(size, na.rm = TRUE))
-
-Sum1
-## # A tibble: 19 x 2
-## farmer_id total_area
-## <int> <dbl>
-## 1 1 5.7
-## 2 2 10.4
-## 3 3 16
-## 4 4 3.7
-## 5 5 1.7
-## 6 6 1.3
-## 7 7 13.2
-## 8 8 3.5
-## 9 9 5.7
-## 10 10 3.1
-## 11 11 6.2
-## 12 12 7.9
-## 13 14 3.1
-## 14 15 9.1
-## 15 16 4
-## 16 17 10.5
-## 17 18 12
-## 18 19 5.8
-## 19 20 12.9
-When using summarise, the number of rows returned will be equal to the number of groups in your data. So we had 19 groups, so 19 rows. If our data was not grouped then we would have had only 1 row returned. Also when summarising data, the resulting data will only contain the variables used to group your data and the summary variables you have created. We drop all other variables.
-Our resulting data has moved data from the plot level and summarised it up to the farmer level.
-Just like mutate
we can start creating many different summary variables by sedating the calculations with a comma.
In this example, we have additionally created summaries for the average size of a farmer’s plots, the standard deviation of plot size and finally we have used n()
to generate a variable counting how many rows are in each group. Or in other words, how many plots each farmer has. Note that we have NA values for sd as a this has happened where the farmer has only 1 plot, therefore a standard deviation cannot be calculated.
Sum2 <- Plot_data %>%
- group_by(farmer_id) %>%
- summarise(total_area = sum(size, na.rm = TRUE),
- avg_area = mean(size, na.rm = TRUE),
- sd = sd(size, na.rm = TRUE),
- nplots = n())
-farmer_id | -total_area | -avg_area | -sd | -nplots | -
---|---|---|---|---|
1 | -5.7 | -1.900000 | -1.4000000 | -3 | -
2 | -10.4 | -2.600000 | -1.3114877 | -4 | -
3 | -16.0 | -2.666667 | -1.2675436 | -6 | -
4 | -3.7 | -3.700000 | -NA | -1 | -
5 | -1.7 | -1.700000 | -NA | -1 | -
6 | -1.3 | -1.300000 | -NA | -1 | -
7 | -13.2 | -3.300000 | -1.1661904 | -4 | -
8 | -3.5 | -3.500000 | -NA | -1 | -
9 | -5.7 | -1.900000 | -0.5291503 | -3 | -
10 | -3.1 | -3.100000 | -NA | -1 | -
11 | -6.2 | -3.100000 | -1.1313708 | -2 | -
12 | -7.9 | -2.633333 | -2.0132892 | -3 | -
14 | -3.1 | -3.100000 | -NA | -1 | -
15 | -9.1 | -3.033333 | -1.5534907 | -3 | -
16 | -4.0 | -2.000000 | -0.9899495 | -2 | -
17 | -10.5 | -3.500000 | -1.4000000 | -3 | -
18 | -12.0 | -3.000000 | -1.6206994 | -4 | -
19 | -5.8 | -2.900000 | -1.9798990 | -2 | -
20 | -12.9 | -2.580000 | -1.8253767 | -5 | -
If for any reason you want to then get rid of this grouping, perhaps at a later point you want to summarise across your whole data and not groups, or by a different grouping variable entirely.
+Then we can just add in ungroup
and our data will be ungrouped.
Plot_data <- Plot_data %>%
+ ungroup()
+Later in this workbook we will also see how we can group by more than one variable.
+Now that we know how to group our variables, let’s look at how we can start aggregating our data.
+The simplest way to start creating data summaries is to use the summarise
function from the dplyr package.
This will generate summary statistics that you define for each of the groups in your data, or for your whole dataset if your data is not grouped.
+Grammatically, it works in a similar way to the mutate
function, we first provide a new name for our summary statistic. In this case we have decided to calculate the total area of all of the farmers plots.
On the other side of the equals sign, we right the calculation/function that we want to make.
+As we want a total, we use the function sum
to sum up all the area sizes. Note that we have added the argument na.rm = TRUE
this makes sure that any missing values are removed from the calculation. If this is not included and there is missing data, then our result would just read NA
and not give an actual number.
Sum1 <- Plot_data %>%
+ group_by(farmer_id) %>%
+ summarise(total_area = sum(size, na.rm = TRUE))
+
+Sum1
+When using summarise, the number of rows returned will be equal to the number of groups in your data. So we had 19 groups, so 19 rows. If our data was not grouped then we would have had only 1 row returned. Also when summarising data, the resulting data will only contain the variables used to group your data and the summary variables you have created. All other variables will be dropped.
+Our resulting data has moved data from the plot level and summarised it up to the farmer level.
+Just like mutate
we can start creating many different summary variables by separating the calculations with a comma.
In this example, we have additionally created summaries for the average size of a farmer’s plots, the standard deviation of plot size and finally we have used n()
to generate a variable counting how many rows are in each group. Or in other words, how many plots each farmer has. Note that we have NA values for sd as a this has happened where the farmer has only 1 plot, therefore a standard deviation cannot be calculated.
Sum2 <- Plot_data %>%
+ group_by(farmer_id) %>%
+ summarise(total_area = sum(size, na.rm = TRUE),
+ avg_area = mean(size, na.rm = TRUE),
+ sd = sd(size, na.rm = TRUE),
+ nplots = n())
+
+Sum2
+There are a number of other summarise functions that can be used to apply the same function to multiple columns rather than go through one by one. For more on this, please follow this link
You can also use mutate to generate these same variables, while keeping your data set at the plot level.
We keep everything within the function the exact same.
@@ -969,287 +278,24 @@plot_id | -farmer_id | -size | -fertiliser_1 | -fertiliser_2 | -total_area | -avg_area | -sd | -nplots | -
---|---|---|---|---|---|---|---|---|
6 | -1 | -1.9 | -4 | -1 | -5.7 | -1.900000 | -1.400000 | -3 | -
10 | -1 | -3.3 | -4 | -3 | -5.7 | -1.900000 | -1.400000 | -3 | -
38 | -1 | -0.5 | -4 | -2 | -5.7 | -1.900000 | -1.400000 | -3 | -
36 | -2 | -2.1 | -2 | -1 | -10.4 | -2.600000 | -1.311488 | -4 | -
43 | -2 | -2.3 | -4 | -1 | -10.4 | -2.600000 | -1.311488 | -4 | -
44 | -2 | -4.5 | -2 | -4 | -10.4 | -2.600000 | -1.311488 | -4 | -
47 | -2 | -1.5 | -3 | -2 | -10.4 | -2.600000 | -1.311488 | -4 | -
8 | -3 | -3.9 | -1 | -1 | -16.0 | -2.666667 | -1.267544 | -6 | -
17 | -3 | -3.1 | -1 | -1 | -16.0 | -2.666667 | -1.267544 | -6 | -
18 | -3 | -1.5 | -1 | -1 | -16.0 | -2.666667 | -1.267544 | -6 | -
This could be useful for creating a new variable that is at one level, but is also dependent on a variable from a level higher up. For example, by keeping our data at plot level we could generate a new variable that is the proportion of the total area that each plot represents across each farmer.
Sum3 <- Sum3 %>%
- mutate(plot_area_prop = size/total_area)
-plot_id | -farmer_id | -size | -fertiliser_1 | -fertiliser_2 | -total_area | -avg_area | -sd | -nplots | -plot_area_prop | -
---|---|---|---|---|---|---|---|---|---|
6 | -1 | -1.9 | -4 | -1 | -5.7 | -1.900000 | -1.400000 | -3 | -0.3333333 | -
10 | -1 | -3.3 | -4 | -3 | -5.7 | -1.900000 | -1.400000 | -3 | -0.5789474 | -
38 | -1 | -0.5 | -4 | -2 | -5.7 | -1.900000 | -1.400000 | -3 | -0.0877193 | -
36 | -2 | -2.1 | -2 | -1 | -10.4 | -2.600000 | -1.311488 | -4 | -0.2019231 | -
43 | -2 | -2.3 | -4 | -1 | -10.4 | -2.600000 | -1.311488 | -4 | -0.2211538 | -
44 | -2 | -4.5 | -2 | -4 | -10.4 | -2.600000 | -1.311488 | -4 | -0.4326923 | -
47 | -2 | -1.5 | -3 | -2 | -10.4 | -2.600000 | -1.311488 | -4 | -0.1442308 | -
8 | -3 | -3.9 | -1 | -1 | -16.0 | -2.666667 | -1.267544 | -6 | -0.2437500 | -
17 | -3 | -3.1 | -1 | -1 | -16.0 | -2.666667 | -1.267544 | -6 | -0.1937500 | -
18 | -3 | -1.5 | -1 | -1 | -16.0 | -2.666667 | -1.267544 | -6 | -0.0937500 | -
Row binding would be used when you have two or more datasets containing the same variables, the difference is that they contain separate sets of observations. For example, you could have data collected in one location and data collected in another but this data has been stored apart. They contain the same variables as the data collection tool was identical but different observations. You can bind these datasets together by their rows.
Let’s pretend that our farmer data was originally stored into different data sets and now we want to combine them together
Farmer_dataA
-## farmer_id village_id name
-## 1 1 3 Saadiq
-## 2 2 3 Jelicha
-## 3 3 1 Briana
-## 4 4 5 Daeun
-## 5 5 5 Ruwaida
-## 6 6 2 Fawzi
-## 7 7 2 Yashna
-## 8 8 1 Tara
-## 9 9 5 Melina
-## 10 10 5 Jeelaan
-## 11 11 1 Razeena
-## 12 12 1 Sidqi
+First we have farmers 1 through 12
Farmer_dataB
-## farmer_id village_id name
-## 13 13 5 Eric
-## 14 14 5 Ryann
-## 15 15 2 Kingsley
-## 16 16 2 Rifqa
-## 17 17 1 Denise
-## 18 18 4 Suhaib
-## 19 19 1 Daniel
-## 20 20 4 Sara
+Then farmers 13 through 20.
It is thankfully very simple to bind these two datasets as we have the same number of columns in our data, and they have the same names.
We can use the base r function rbind
to simply achieve this binding. All we need to do is add the names of the data sets we want to bind.
rbind(Farmer_dataA, Farmer_dataB)
-## farmer_id village_id name
-## 1 1 3 Saadiq
-## 2 2 3 Jelicha
-## 3 3 1 Briana
-## 4 4 5 Daeun
-## 5 5 5 Ruwaida
-## 6 6 2 Fawzi
-## 7 7 2 Yashna
-## 8 8 1 Tara
-## 9 9 5 Melina
-## 10 10 5 Jeelaan
-## 11 11 1 Razeena
-## 12 12 1 Sidqi
-## 13 13 5 Eric
-## 14 14 5 Ryann
-## 15 15 2 Kingsley
-## 16 16 2 Rifqa
-## 17 17 1 Denise
-## 18 18 4 Suhaib
-## 19 19 1 Daniel
-## 20 20 4 Sara
+Now with rbind there is a little issue that occurs if there are variables in one dataset that aren’t in the other. Let’s add an age variable to our first dataset but not the second and see what happens when we try to merge them
-## farmer_id village_id name Age
-## 1 1 3 Saadiq 45
-## 2 2 3 Jelicha 63
-## 3 3 1 Briana 55
-## 4 4 5 Daeun 28
-## 5 5 5 Ruwaida 25
-## 6 6 2 Fawzi 40
-## 7 7 2 Yashna 37
-## 8 8 1 Tara 22
-## 9 9 5 Melina 19
-## 10 10 5 Jeelaan 58
-## 11 11 1 Razeena 29
-## 12 12 1 Sidqi 20
+rbind(Farmer_dataA, Farmer_dataB)
## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match
We get an error instead. This is because rbind
requires that there are exactly the same number of columns in both data sets.
To get around this issue, there is the bind_rows
function from dyplr. This matches the columns by their names (so make sure those are the same in both datasets still), and if there are any column not present in both datasets, it will just fill this with NA in the data set where it is not present.
bind_rows(Farmer_dataA, Farmer_dataB)
-## farmer_id village_id name Age
-## 1 1 3 Saadiq 45
-## 2 2 3 Jelicha 63
-## 3 3 1 Briana 55
-## 4 4 5 Daeun 28
-## 5 5 5 Ruwaida 25
-## 6 6 2 Fawzi 40
-## 7 7 2 Yashna 37
-## 8 8 1 Tara 22
-## 9 9 5 Melina 19
-## 10 10 5 Jeelaan 58
-## 11 11 1 Razeena 29
-## 12 12 1 Sidqi 20
-## 13 13 5 Eric NA
-## 14 14 5 Ryann NA
-## 15 15 2 Kingsley NA
-## 16 16 2 Rifqa NA
-## 17 17 1 Denise NA
-## 18 18 4 Suhaib NA
-## 19 19 1 Daniel NA
-## 20 20 4 Sara NA
+Column binding on the other hand would be utilised when we have the same set of observations, but the variables are different. Perhaps we have taken additional measurements at a later point and want to bring this together with the original data.
For example, let’s look to our plot data and bring in data on the yield of the crops grown on each plot. Data we have collected at a later point.
Plot_data2
-## plot_id yield_kg_ha
-## 1 1 20
-## 2 2 400
-## 3 3 0
-## 4 4 320
-## 5 5 240
-## 6 6 400
-## 7 7 240
-## 8 8 500
-## 9 9 140
-## 10 10 20
-## 11 11 160
-## 12 12 480
-## 13 13 140
-## 14 14 460
-## 15 15 480
-## 16 16 380
-## 17 17 260
-## 18 18 500
-## 19 19 380
-## 20 20 80
-## 21 21 180
-## 22 22 220
-## 23 23 380
-## 24 24 440
-## 25 25 460
-## 26 26 280
-## 27 27 20
-## 28 28 0
-## 29 29 80
-## 30 30 500
-## 31 31 480
-## 32 32 80
-## 33 33 140
-## 34 34 240
-## 35 35 480
-## 36 36 60
-## 37 37 180
-## 38 38 100
-## 39 39 340
-## 40 40 80
-## 41 41 380
-## 42 42 0
-## 43 43 100
-## 44 44 260
-## 45 45 240
-## 46 46 140
-## 47 47 320
-## 48 48 360
-## 49 49 300
-## 50 50 280
+For column binding, as you may expect the base r function is cbind
and the dplyr alternative is bind_cols
. Unlike with row binding, there is not really any difference between how the functions operate.
Both require the same number of rows in order to operate. These should also be in the same order, if they are not you could use arrange
first to make sure that they are.
cbind(Plot_data, Plot_data2)
-## plot_id farmer_id size fertiliser_1 fertiliser_2 plot_id yield_kg_ha
-## 1 1 10 3.1 4 2 1 20
-## 2 2 17 4.1 2 1 2 400
-## 3 3 12 2.9 2 3 3 0
-## 4 4 7 2.5 4 3 4 320
-## 5 5 8 3.5 2 1 5 240
-## 6 6 1 1.9 4 1 6 400
-## 7 7 20 2.1 1 1 7 240
-## 8 8 3 3.9 1 1 8 500
-## 9 9 11 2.3 4 2 9 140
-## 10 10 1 3.3 4 3 10 20
-## 11 11 15 4.3 3 1 11 160
-## 12 12 6 1.3 4 2 12 480
-## 13 13 9 1.3 4 3 13 140
-## 14 14 5 1.7 1 1 14 460
-## 15 15 15 1.3 2 1 15 480
-## 16 16 18 2.7 4 1 16 380
-## 17 17 3 3.1 1 1 17 260
-## 18 18 3 1.5 1 1 18 500
-## 19 19 7 4.3 3 1 19 380
-## 20 20 15 3.5 1 1 20 80
-## 21 21 3 3.3 3 1 21 180
-## 22 22 20 0.7 1 1 22 220
-## 23 23 12 0.5 2 3 23 380
-## 24 24 18 0.9 2 1 24 440
-## 25 25 9 2.1 2 3 25 460
-## 26 26 19 1.5 4 2 26 280
-## 27 27 18 3.7 3 1 27 20
-## 28 28 19 4.3 1 1 28 0
-## 29 29 16 1.3 3 1 29 80
-## 30 30 7 2.1 1 1 30 500
-## 31 31 16 2.7 1 1 31 480
-## 32 32 20 4.5 1 1 32 80
-## 33 33 17 1.9 3 4 33 140
-## 34 34 20 1.1 1 1 34 240
-## 35 35 4 3.7 3 2 35 480
-## 36 36 2 2.1 2 1 36 60
-## 37 37 17 4.5 1 1 37 180
-## 38 38 1 0.5 4 2 38 100
-## 39 39 12 4.5 4 1 39 340
-## 40 40 7 4.3 3 4 40 80
-## 41 41 9 2.3 3 4 41 380
-## 42 42 14 3.1 4 2 42 0
-## 43 43 2 2.3 4 1 43 100
-## 44 44 2 4.5 2 4 44 260
-## 45 45 20 4.5 4 1 45 240
-## 46 46 18 4.7 1 1 46 140
-## 47 47 2 1.5 3 2 47 320
-## 48 48 11 3.9 2 4 48 360
-## 49 49 3 3.5 4 3 49 300
-## 50 50 3 0.7 4 3 50 280
+bind_cols(Plot_data, Plot_data2)
-## # A tibble: 50 x 7
-## plot_id...1 farmer_id size fertiliser_1 fertiliser_2 plot_id...6 yield_kg_ha
-## <int> <int> <dbl> <int> <dbl> <int> <dbl>
-## 1 1 10 3.1 4 2 1 20
-## 2 2 17 4.1 2 1 2 400
-## 3 3 12 2.9 2 3 3 0
-## 4 4 7 2.5 4 3 4 320
-## 5 5 8 3.5 2 1 5 240
-## 6 6 1 1.9 4 1 6 400
-## 7 7 20 2.1 1 1 7 240
-## 8 8 3 3.9 1 1 8 500
-## 9 9 11 2.3 4 2 9 140
-## 10 10 1 3.3 4 3 10 20
-## # ... with 40 more rows
+Now you will notice that because both datasets contained plot_id
, our resulting data table has two id columns unhelpfully names “plot_id…1” and “plot_id…6”. This is because column binding will not merge information from columns that have the same name, rather they will just change the names.
Therefore, if we were to use column binding it would be a good idea to drop the plot_id from one dataset and then perform the bind.
We can use select
and then put a -
before the name of the variable to remove it from the data
Though of course this stresses the importance that these ids MUST be identical in both data sets. The same numbers, and those numbers mean the same thing.
Plot_data3 <- Plot_data2 %>%
select(-plot_id)
Now when we bind the data sets. We keep plot_id as normal
Plot_data <- bind_cols(Plot_data, Plot_data3)
Plot_data
-## # A tibble: 50 x 6
-## plot_id farmer_id size fertiliser_1 fertiliser_2 yield_kg_ha
-## <int> <int> <dbl> <int> <dbl> <dbl>
-## 1 1 10 3.1 4 2 20
-## 2 2 17 4.1 2 1 400
-## 3 3 12 2.9 2 3 0
-## 4 4 7 2.5 4 3 320
-## 5 5 8 3.5 2 1 240
-## 6 6 1 1.9 4 1 400
-## 7 7 20 2.1 1 1 240
-## 8 8 3 3.9 1 1 500
-## 9 9 11 2.3 4 2 140
-## 10 10 1 3.3 4 3 20
-## # ... with 40 more rows
+This binding may have been more smoothly achieved if we had actually done a join instead as we shall see later in the workbook.
Joining entails the more traditional form of data merging, we are bringing together data from multiple related data tables and these data tables do not contain the same levels of information.
Recall from the session that a full join will keep the rows from both data sets, regardless of whether or not there is a match.
Let’s again look at merging our two pieces of plot data but rather than having yield data for all of our plots, we have it for 40 out of 50 of them.
+Plot_data
+Plot_data2
-## plot_id yield_kg_ha
-## 1 1 20
-## 2 2 400
-## 3 3 0
-## 4 4 320
-## 5 5 240
-## 7 7 240
-## 9 9 140
-## 10 10 20
-## 11 11 160
-## 13 13 140
-## 14 14 460
-## 16 16 380
-## 18 18 500
-## 19 19 380
-## 20 20 80
-## 22 22 220
-## 23 23 380
-## 24 24 440
-## 25 25 460
-## 26 26 280
-## 27 27 20
-## 28 28 0
-## 30 30 500
-## 31 31 480
-## 33 33 140
-## 34 34 240
-## 35 35 480
-## 36 36 60
-## 37 37 180
-## 38 38 100
-## 39 39 340
-## 40 40 80
-## 41 41 380
-## 42 42 0
-## 43 43 100
-## 45 45 240
-## 46 46 140
-## 47 47 320
-## 49 49 300
-## 50 50 280
+From the dyplr package we would want to use the full_join
function.
We first specify the two datasets we want to merge.
Then we use the by =
argument to tell R what variables are we using to merge this data. In this case we want to use the primary key of these two data sets, plot_id. Remember that a primary key is the key that uniquely identifies each row in your data. With our plot level data sets, each row is uniquely identified by that plot identification number.
full_join(Plot_data, Plot_data2, by = "plot_id")
-## # A tibble: 50 x 6
-## plot_id farmer_id size fertiliser_1 fertiliser_2 yield_kg_ha
-## <int> <int> <dbl> <int> <dbl> <dbl>
-## 1 1 10 3.1 4 2 20
-## 2 2 17 4.1 2 1 400
-## 3 3 12 2.9 2 3 0
-## 4 4 7 2.5 4 3 320
-## 5 5 8 3.5 2 1 240
-## 6 6 1 1.9 4 1 NA
-## 7 7 20 2.1 1 1 240
-## 8 8 3 3.9 1 1 NA
-## 9 9 11 2.3 4 2 140
-## 10 10 1 3.3 4 3 20
-## # ... with 40 more rows
+We can see how this was quicker and simpler than trying to achieve the same result with column binding.
Note that as both pieces of data are at the plot level, we had a 1 to 1 relationship between the data sets. For every plot in table 1 there will be no more than 1 plot in table 2.
Now notice that there are some NA values for yield. This is because there was not a match between the two data sets. We did not have yield for plots 6, 8, etc. But because we used a full join, these rows are kept and not excluded. All data remains intact and we still have 50 rows.
-If we had an extra plot in our Plot_data2, that was not in our original plot. Then this plot would have also been kept in the data.
-See we now have added a 51st plot with a yield of 320kg/ha
+Let’s try this again with an additional plot in Plot_data2, that does not exist in plot_data.
+As we know it is our 51st plot, let’s take a look at the details of this plot only. We can see that it has a yield of 320 kg/ha.Note that is is row 41 not 51 because our plot_data2 was missing 10 rows.
Plot_data2[41,]
-## plot_id yield_kg_ha
-## 41.1 51 320
+When joining this data now, plot 51 is kept in the data, but we would have all this missing data including farmer_id. Not always particularly helpful.
Plot_data4 <- full_join(Plot_data, Plot_data2, by = "plot_id")
Plot_data4[51,]
-## # A tibble: 1 x 6
-## plot_id farmer_id size fertiliser_1 fertiliser_2 yield_kg_ha
-## <dbl> <int> <dbl> <int> <dbl> <dbl>
-## 1 51 NA NA NA NA 320
-I have tended to find full join most useful when using it in a similar fashion to r bind, where i have similar datasets that have many of the same variables but additional observations and perhaps even additional variables for existing observations.
-For instance, let’s have another look at merging two instances of our farmer data. But make a slight adjustment so some of the rows are shared but there is again another variable, household size.
+I tend to find full join most useful when using it in a similar fashion to rbind, where I have similar datasets with many of the same variables but additional observations and perhaps even additional variables for existing observations.
+. Let’s have another look at merging two instances of our farmer data. This time it’s slightly different: only some of the rows are shared between the two datasets and one dataset has an additional variable, household size.
Farmer_dataC
-## farmer_id village_id name
-## 1 1 3 Saadiq
-## 2 2 3 Jelicha
-## 3 3 1 Briana
-## 4 4 5 Daeun
-## 5 5 5 Ruwaida
-## 6 6 2 Fawzi
-## 7 7 2 Yashna
-## 8 8 1 Tara
-## 9 9 5 Melina
-## 10 10 5 Jeelaan
-## 11 11 1 Razeena
-## 12 12 1 Sidqi
+Farmer_dataD
-## farmer_id village_id name hhsize
-## 9 9 5 Melina 5
-## 10 10 5 Jeelaan 10
-## 11 11 1 Razeena 7
-## 12 12 1 Sidqi 4
-## 13 13 5 Eric 10
-## 14 14 5 Ryann 8
-## 15 15 2 Kingsley 8
-## 16 16 2 Rifqa 4
-## 17 17 1 Denise 10
-## 18 18 4 Suhaib 7
-## 19 19 1 Daniel 8
-## 20 20 4 Sara 8
-## 13.1 21 2 Sam 8
-## 14.1 22 5 Dave 5
+So in our second data set we have 2 additional farmers Sam and Dave, and then also additional household size information for farmers 9 to 12.
So let’s join these together and see what happens
full_join(Farmer_dataC, Farmer_dataD, by = "farmer_id")
-## farmer_id village_id.x name.x village_id.y name.y hhsize
-## 1 1 3 Saadiq NA <NA> NA
-## 2 2 3 Jelicha NA <NA> NA
-## 3 3 1 Briana NA <NA> NA
-## 4 4 5 Daeun NA <NA> NA
-## 5 5 5 Ruwaida NA <NA> NA
-## 6 6 2 Fawzi NA <NA> NA
-## 7 7 2 Yashna NA <NA> NA
-## 8 8 1 Tara NA <NA> NA
-## 9 9 5 Melina 5 Melina 5
-## 10 10 5 Jeelaan 5 Jeelaan 10
-## 11 11 1 Razeena 1 Razeena 7
-## 12 12 1 Sidqi 1 Sidqi 4
-## 13 13 NA <NA> 5 Eric 10
-## 14 14 NA <NA> 5 Ryann 8
-## 15 15 NA <NA> 2 Kingsley 8
-## 16 16 NA <NA> 2 Rifqa 4
-## 17 17 NA <NA> 1 Denise 10
-## 18 18 NA <NA> 4 Suhaib 7
-## 19 19 NA <NA> 1 Daniel 8
-## 20 20 NA <NA> 4 Sara 8
-## 21 21 NA <NA> 2 Sam 8
-## 22 22 NA <NA> 5 Dave 5
+Now we have 22 rows as we would expect, one for each farmer. But something has gone wrong with our village and name variables. This is because we only told R to match on farmer_id. While this logically makes sense, it means that R will think that other column with the same name are not identical, when in our case they are. As a result we get columns like name.x and name.y.
In order to avoid this we also need to include any common columns between our two data sets by listing them.
full_join(Farmer_dataC, Farmer_dataD, by = c("farmer_id", "village_id", "name"))
-## farmer_id village_id name hhsize
-## 1 1 3 Saadiq NA
-## 2 2 3 Jelicha NA
-## 3 3 1 Briana NA
-## 4 4 5 Daeun NA
-## 5 5 5 Ruwaida NA
-## 6 6 2 Fawzi NA
-## 7 7 2 Yashna NA
-## 8 8 1 Tara NA
-## 9 9 5 Melina 5
-## 10 10 5 Jeelaan 10
-## 11 11 1 Razeena 7
-## 12 12 1 Sidqi 4
-## 13 13 5 Eric 10
-## 14 14 5 Ryann 8
-## 15 15 2 Kingsley 8
-## 16 16 2 Rifqa 4
-## 17 17 1 Denise 10
-## 18 18 4 Suhaib 7
-## 19 19 1 Daniel 8
-## 20 20 4 Sara 8
-## 21 21 2 Sam 8
-## 22 22 5 Dave 5
+This seems to solve the problem and we have kept information from both data tables even where we were missing household size information.
If our variables were called something different then we can control for this as well. For example, say in our previous example, in the second dataset plot_id was instead called plot_name. Then we could have written this instead. with the column name in the first data set on the left, and the other column name on the right.
full_join(Plot_data, Plot_data2, by = c("plot_id" = "plot_name"))
@@ -1678,57 +474,30 @@ This is regardless of whether or not there is a match.
So if we were to join the two sets of plot data that we had on the previous page. We could use a left join to stop us from adding in that 51st plot for which we have very little data.
left_join(Plot_data, Plot_data2, by = "plot_id")
-## # A tibble: 50 x 6
-## plot_id farmer_id size fertiliser_1 fertiliser_2 yield_kg_ha
-## <dbl> <int> <dbl> <int> <dbl> <dbl>
-## 1 1 10 3.1 4 2 20
-## 2 2 17 4.1 2 1 400
-## 3 3 12 2.9 2 3 0
-## 4 4 7 2.5 4 3 320
-## 5 5 8 3.5 2 1 240
-## 6 6 1 1.9 4 1 NA
-## 7 7 20 2.1 1 1 240
-## 8 8 3 3.9 1 1 NA
-## 9 9 11 2.3 4 2 140
-## 10 10 1 3.3 4 3 20
-## # ... with 40 more rows
+We have kept all 50 rows from the first set of plot data, including those without any yield data in the second set. All while removing that 51st plot from our data.
If we were to instead keep the arguments the same but change this to a right join we would see 41 rows. The 40 plots for which we have yield information and also the other variables. Plus the other plot for which we only have yield data. Therefore removing the 10 plots where we have that original information but not yield.
right_join(Plot_data, Plot_data2, by = "plot_id")
-## # A tibble: 41 x 6
-## plot_id farmer_id size fertiliser_1 fertiliser_2 yield_kg_ha
-## <dbl> <int> <dbl> <int> <dbl> <dbl>
-## 1 1 10 3.1 4 2 20
-## 2 2 17 4.1 2 1 400
-## 3 3 12 2.9 2 3 0
-## 4 4 7 2.5 4 3 320
-## 5 5 8 3.5 2 1 240
-## 6 7 20 2.1 1 1 240
-## 7 9 11 2.3 4 2 140
-## 8 10 1 3.3 4 3 20
-## 9 11 15 4.3 3 1 160
-## 10 13 9 1.3 4 3 140
-## # ... with 31 more rows
+An inner join trims this down further still and will only keep the rows where there is a corresponding match between each data set.
So in our plot example we would drop both the plots which have no yield data, and the 51st plot for which we have no other information.
inner_join(Plot_data, Plot_data2, by = "plot_id")
-## # A tibble: 40 x 6
-## plot_id farmer_id size fertiliser_1 fertiliser_2 yield_kg_ha
-## <dbl> <int> <dbl> <int> <dbl> <dbl>
-## 1 1 10 3.1 4 2 20
-## 2 2 17 4.1 2 1 400
-## 3 3 12 2.9 2 3 0
-## 4 4 7 2.5 4 3 320
-## 5 5 8 3.5 2 1 240
-## 6 7 20 2.1 1 1 240
-## 7 9 11 2.3 4 2 140
-## 8 10 1 3.3 4 3 20
-## 9 11 15 4.3 3 1 160
-## 10 13 9 1.3 4 3 140
-## # ... with 30 more rows
+As a result we have 40 rows. Because we had 40 plots were information was available from both datasets.
First let’s bring some data down a level. For instance let’s say we want to analyse some data at the plot level but we want to group this data by village to generate a village level average plot size.
But we do not know which farmers lives in which village just by looking at this plot level data.
-But we do have farmer_id which can link us up with the farmer data. A data set which does include the village id as a variable.
+But we do have farmer_id which can link us up with the farmer data. A data set which does include the farmer data, which includes the village id as a variable.
So we need to merge information from the farmer data and the plot data using this link. Here our foreign key is farmer_id as it does not uniquely identify our rows in the plot data. But it is identifying a particular unit of observation, the farmer. This foreign key (farmer_id) links up to the primary key of the farmer data (farmer_id).
This is an example of a 1 to Many relationship as for each individual farmer there can be many plots.
Now you may have noticed that we have 20 farmers in our data, but when we have grouped our plots by farmer, we had only 19 groups. That is because we have one farmer without any plots.
@@ -2011,14 +780,11 @@Plot_data %>%
group_by(village_name) %>%
summarise(avg_plot_size = mean(size, na.rm = TRUE))
-## # A tibble: 5 x 2
-## village_name avg_plot_size
-## <chr> <dbl>
-## 1 Balamory 2.47
-## 2 Langley Falls 2.76
-## 3 New New York 2.3
-## 4 Pawnee 2.77
-## 5 Springfield 2.94
+## # A tibble: 100 x 5
-## plot_id farmer_id size Fertiliser_no fertiliser_id
-## <int> <int> <dbl> <chr> <dbl>
-## 1 1 10 3.1 fertiliser_1 4
-## 2 1 10 3.1 fertiliser_2 2
-## 3 2 17 4.1 fertiliser_1 2
-## 4 2 17 4.1 fertiliser_2 1
-## 5 3 12 2.9 fertiliser_1 2
-## 6 3 12 2.9 fertiliser_2 3
-## 7 4 7 2.5 fertiliser_1 4
-## 8 4 7 2.5 fertiliser_2 3
-## 9 5 8 3.5 fertiliser_1 2
-## 10 5 8 3.5 fertiliser_2 1
-## # ... with 90 more rows
+Now we can bring in the fertiliser data. Again using left join, as we want to prioritise the plot information.
Long_plots <- Long_plots %>%
left_join(Fertiliser_data, by = "fertiliser_id")
Long_plots
-## # A tibble: 100 x 7
-## plot_id farmer_id size Fertiliser_no fertiliser_id name price_per_ha
-## <int> <int> <dbl> <chr> <dbl> <chr> <dbl>
-## 1 1 10 3.1 fertiliser_1 4 Chemical 25
-## 2 1 10 3.1 fertiliser_2 2 Compost 12
-## 3 2 17 4.1 fertiliser_1 2 Compost 12
-## 4 2 17 4.1 fertiliser_2 1 None 0
-## 5 3 12 2.9 fertiliser_1 2 Compost 12
-## 6 3 12 2.9 fertiliser_2 3 Manure 8
-## 7 4 7 2.5 fertiliser_1 4 Chemical 25
-## 8 4 7 2.5 fertiliser_2 3 Manure 8
-## 9 5 8 3.5 fertiliser_1 2 Compost 12
-## 10 5 8 3.5 fertiliser_2 1 None 0
-## # ... with 90 more rows
+Okay so we now have our price information in the data. However there is one further step we need to make before we can go further up the levels of this data. Price is calculated at per ha, so to get total expenditure we need to multiply this by the size of the plot. we can do this using mutate
Long_plots <- Long_plots %>%
mutate(price = price_per_ha * size)
@@ -2083,20 +831,11 @@ ## # A tibble: 50 x 6
-## plot_id price_total farmer_id size fertiliser_1 fertiliser_2
-## <int> <dbl> <int> <dbl> <int> <dbl>
-## 1 1 115. 10 3.1 4 2
-## 2 2 49.2 17 4.1 2 1
-## 3 3 58 12 2.9 2 3
-## 4 4 82.5 7 2.5 4 3
-## 5 5 42 8 3.5 2 1
-## 6 6 47.5 1 1.9 4 1
-## 7 7 0 20 2.1 1 1
-## 8 8 0 3 3.9 1 1
-## 9 9 85.1 11 2.3 4 2
-## 10 10 109. 1 3.3 4 3
-## # ... with 40 more rows
+We now simply repeat the process to bring this up to farmer level. Simply switching out our data arguments and change to use the appropriate keys that link the data sets.
Farmer_data <- Plot_data %>%
group_by(farmer_id) %>%
@@ -2104,29 +843,11 @@ Example 2
right_join(Farmer_data, by = "farmer_id")
Farmer_data
-## # A tibble: 20 x 4
-## farmer_id price_total village_id name
-## <int> <dbl> <int> <chr>
-## 1 1 175. 3 Saadiq
-## 2 2 279. 3 Jelicha
-## 3 3 165 1 Briana
-## 4 4 74 5 Daeun
-## 5 5 0 5 Ruwaida
-## 6 6 48.1 2 Fawzi
-## 7 7 259. 2 Yashna
-## 8 8 42 1 Tara
-## 9 9 161. 5 Melina
-## 10 10 115. 5 Jeelaan
-## 11 11 229. 1 Razeena
-## 12 12 180. 1 Sidqi
-## 13 14 115. 5 Ryann
-## 14 15 50 2 Kingsley
-## 15 16 10.4 2 Rifqa
-## 16 17 112. 1 Denise
-## 17 18 108. 4 Suhaib
-## 18 19 55.5 1 Daniel
-## 19 20 112. 4 Sara
-## 20 13 NA 5 Eric
+Finally, we repeat this one more time to bring it all up to the village level.
Village_data <- Farmer_data %>%
group_by(village_id) %>%
@@ -2134,14 +855,11 @@ Example 2
right_join(Village_data, by = "village_id")
Village_data
-## # A tibble: 5 x 5
-## village_id price_average village_name region_name population
-## <int> <dbl> <chr> <chr> <int>
-## 1 1 131. Springfield A 34387
-## 2 2 91.8 Langley Falls A 462736
-## 3 3 227. New New York A 134412
-## 4 4 110. Pawnee B 446522
-## 5 5 92.8 Balamory B 341729
+If we wanted to get fancy this could have all been done in the one pipe.
Note that as we start with Plot_data we do not need to join on to it like we did before, and because plot data already has the farmer_id we can go straight to summarising the data at that level first before moving it upwards to village.
Plot_data %>%
@@ -2158,14 +876,11 @@ Example 2
group_by(village_id) %>%
summarise(price_average = mean(price_total, na.rm = TRUE)) %>%
right_join(Village_data, by = "village_id")
-## # A tibble: 5 x 5
-## village_id price_average village_name region_name population
-## <int> <dbl> <chr> <chr> <int>
-## 1 1 131. Springfield A 34387
-## 2 2 91.8 Langley Falls A 462736
-## 3 3 227. New New York A 134412
-## 4 4 110. Pawnee B 446522
-## 5 5 92.8 Balamory B 341729
+