-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
684 lines (450 loc) · 28.8 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
---
title: "Relational Data in R"
output:
learnr::tutorial:
progressive: true
allow_skip: true
df_print: paged
runtime: shiny_prerendered
description: >
Learn about the basics of summarising your data and merging different data sets together
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
```{r, echo = FALSE}
library(tidyverse)
library(randomNames)
```
```{r, echo = FALSE}
set.seed(1)
Village_data <-
data.frame(
village_id = 1:5,
village_name = c("Springfield","Langley Falls", "New New York", "Pawnee", "Balamory"),
region_name = c("A","A","A","B", "B"),
population = sample(10000:500000,5)
)
Farmer_data <-
data.frame(
farmer_id = 1:20,
village_id = sample(1:5,20, replace = TRUE),
name = randomNames(20, which.names = "first")
)
Plot_data <-
data.frame(
plot_id = 1:50,
farmer_id = sample(c(1:12,14:20),50, replace = TRUE),
size = sample(seq(from = 0.5, to = 5,by=0.2), 50, replace = TRUE),
fertiliser_1 = sample(1:4, 50, replace = TRUE)
)
for(i in 1:50){
Plot_data$fertiliser_2[i] <- case_when(
Plot_data$fertiliser_1[i] == 1 ~ 1,
Plot_data$fertiliser_1[i] == 2 ~ sample(c(1,3,4),1),
Plot_data$fertiliser_1[i] == 3 ~ sample(c(1,2,4),1),
Plot_data$fertiliser_1[i] == 4 ~ sample(c(1,3,2),1)
)
}
Plot_data2 <-
data.frame(
plot_id = 1:50,
yield_kg_ha = sample(seq(from = 0, to = 500, by = 20), 50, replace = TRUE)
)
Fertiliser_data <-
data.frame(
fertiliser_id = 1:4,
name = c("None", "Compost", "Manure", "Chemical"),
price_per_ha = c(0,12,8,25)
)
```
## Overview
In the live sessions we discussed relational data, what it is and how can we link together its components.
Relational data is a collection of data sets that can all be linked together through various relationships.
These relationships are defined by keys.
* Primary keys uniquely identify each row of a given data set.
* Foreign keys are variables in a dataset which identify a particular unit, these will not be unique to each row of that table. Instead they can be used to link to other tables. This includes a table where that variable will be its Primary key.
Because of these relationships we are able to easily merge data from different tables together. These relationships will take one of three forms.
* 1-1 relationships. Each row in table A can only link to one row in table B. They are using the same primary key
* 1-Many relationships. Each row in table A may link to one or more rows in table B, but each row in table B can only link to one in table A. This is where foreign keys will come into play. Such as linking individual farmers in one table to all of their plots in another. They can have more than one plot in table B but only one entry in table A.
* Many-Many relationships. A bit more technical, but this is where each row in table A can link to multiple rows in table B, but likewise each row in table B can link to multiple rows in table A. For example, each plot can contain many plants, and many plant species can be grown on many plots.
From these relationships, we also have many different types of merging
* Column binding. This sticks together two data tables that have the same number of rows, basically appending additional variables to the same set of observations.
* Row binding, this is appending additional rows/observations onto an existing data set. These must have at least mutual column names or the same columns exactly.
Joins are for merging information generally about different units. Such as appending information about farmers onto the data about their plots. Therefore matching between the keys drives this method.
* Full joins will keep all rows from both data sets, even where rows that could not be matched.
* Left/Right joins will keep all rows from one of the two datasets, depending on which direction is specified in the join.Any rows from the secondary dataset that are not matched will be dropped from the output.
* Inner joins will keep only those rows from where a match was found.
In this workbook, we will look through first how to group and summarise your data before taking an extensive look at how we can start merging some of our data together.
## Data
For this workbook we will be using four relational data tables, each storing data regarding different units of analysis. However each and everyone can be linked across this hierarchy.
At the top level we have some data about the Villages in our study
```{r, echo = FALSE}
Village_data
```
Below this we have information on a number of farmers who took part in our study
```{r, echo = FALSE}
Farmer_data
```
Further down still, we have our plot level data
```{r, echo = FALSE}
Plot_data
```
Lastly, we have some further information on fertilizers.
```{r, echo = FALSE}
Fertiliser_data
```
## Aggregating data
With any data analysis, at some point you are going to have to summarise your data in some way. Sometimes you may need to even do this prior to analysis as part of your data cleaning process. such as for the generation of new variables.This is certain to be true if you are handling relational data.
More often than not, you will need to do this summary by some sort of dis/aggregation variable.
In order to do that in R, you first need to know how to group your data.
### Group_by
Grouping data in R with dplyr is very simple.
You use the `group_by` function to create implicit groupings within your data.
Lets first look at grouping our Plot data by the farmers who tend them.
We simply provide the name of the variable we want to use to define our groups.
```{r}
Plot_data <- Plot_data %>%
group_by(farmer_id)
Plot_data
```
Looking at the data, it will look like nothing has changed. This is because the groups are implicit, it is in essence now part of the metadata of the data. But R will recognise that this grouping exists within the dataset and perform operations accordingly.
Indeed if we take a look at the structure of our data using `str`, which displays the structure of an R object, we will see we will see that rather than our data being a data.frame (the standard data format in R), it is now a grouped_df (grouped data frame). From this we can see that we have created 19 groupings in our data and we can see for each group which rows belong to it.
So our first group (farmer 1) has 3 rows in the data, and they are rows 6,10 and 38.
```{r}
str(Plot_data)
```
If for any reason you want to then get rid of this grouping, perhaps at a later point you want to summarise across your whole data and not groups, or by a different grouping variable entirely.
Then we can just add in `ungroup` and our data will be ungrouped.
```{r}
Plot_data <- Plot_data %>%
ungroup()
```
Later in this workbook we will also see how we can group by more than one variable.
### Summarise / grouped mutates
Now that we know how to group our variables, let's look at how we can start aggregating our data.
The simplest way to start creating data summaries is to use the `summarise` function from the dplyr package.
This will generate summary statistics that you define for each of the groups in your data, or for your whole dataset if your data is not grouped.
Grammatically, it works in a similar way to the `mutate` function, we first provide a new name for our summary statistic. In this case we have decided to calculate the total area of all of the farmers plots.
On the other side of the equals sign, we right the calculation/function that we want to make.
As we want a total, we use the function `sum` to sum up all the area sizes. Note that we have added the argument `na.rm = TRUE` this makes sure that any missing values are removed from the calculation. If this is not included and there is missing data, then our result would just read `NA` and not give an actual number.
```{r}
Sum1 <- Plot_data %>%
group_by(farmer_id) %>%
summarise(total_area = sum(size, na.rm = TRUE))
Sum1
```
When using summarise, the number of rows returned will be equal to the number of groups in your data. So we had 19 groups, so 19 rows. If our data was not grouped then we would have had only 1 row returned. Also when summarising data, the resulting data will only contain the variables used to group your data and the summary variables you have created. All other variables will be dropped.
Our resulting data has moved data from the plot level and summarised it up to the farmer level.
Just like `mutate` we can start creating many different summary variables by separating the calculations with a comma.
In this example, we have additionally created summaries for the average size of a farmer's plots, the standard deviation of plot size and finally we have used `n()` to generate a variable counting how many rows are in each group. Or in other words, how many plots each farmer has. Note that we have NA values for sd as a this has happened where the farmer has only 1 plot, therefore a standard deviation cannot be calculated.
```{r}
Sum2 <- Plot_data %>%
group_by(farmer_id) %>%
summarise(total_area = sum(size, na.rm = TRUE),
avg_area = mean(size, na.rm = TRUE),
sd = sd(size, na.rm = TRUE),
nplots = n())
Sum2
```
There are a number of other summarise functions that can be used to apply the same function to multiple columns rather than go through one by one. For more on this, please follow this [link](https://dplyr.tidyverse.org/reference/summarise_all.html)
You can also use mutate to generate these same variables, while keeping your data set at the plot level.
We keep everything within the function the exact same.
All we do is swap `summarise` for `mutate`
We have added in an `arrange` to sort the data now on the farmer_id rather than plot_id, just to make this demonstration a little clearer.
Note that we have kept all of our plot specific data and we still have 50 rows. But now we also have variables for total_area, avg_area, sd and nplots. The results of which are the same as what was calculated by summarise, but instead we see these results replicated for each row within a group.
```{r}
Sum3 <- Plot_data %>%
group_by(farmer_id) %>%
mutate(total_area = sum(size, na.rm = TRUE),
avg_area = mean(size, na.rm = TRUE),
sd = sd(size, na.rm = TRUE),
nplots = n()) %>%
arrange(farmer_id)
Sum3
```
This could be useful for creating a new variable that is at one level, but is also dependent on a variable from a level higher up. For example, by keeping our data at plot level we could generate a new variable that is the proportion of the total area that each plot represents across each farmer.
```{r}
Sum3 <- Sum3 %>%
mutate(plot_area_prop = size/total_area)
Sum3
```
## Binding data
In the live session we introduced two types of data merging. Binding and joining. First let's look at binding this involves combining data tables which are very similar, they will not contain related information from different levels but generally additional data that is for the same set of observations or additional observations for the same set of variables.
The two data tables would therefore have a very similar structure. We are not trying to match data from two separate tables but rather add to an existing table.
There are two ways that we can do this. By rows or by columns.
### Row binding
Row binding would be used when you have two or more datasets containing the same variables, the difference is that they contain separate sets of observations. For example, you could have data collected in one location and data collected in another but this data has been stored apart. They contain the same variables as the data collection tool was identical but different observations. You can bind these datasets together by their rows.
Let's pretend that our farmer data was originally stored into different data sets and now we want to combine them together
```{r, echo = FALSE}
Farmer_dataA <- Farmer_data[1:12,]
Farmer_dataB <- Farmer_data[13:20,]
```
```{r}
Farmer_dataA
```
First we have farmers 1 through 12
```{r}
Farmer_dataB
```
Then farmers 13 through 20.
It is thankfully very simple to bind these two datasets as we have the same number of columns in our data, and they have the same names.
We can use the base r function `rbind` to simply achieve this binding. All we need to do is add the names of the data sets we want to bind.
```{r}
rbind(Farmer_dataA, Farmer_dataB)
```
Now with rbind there is a little issue that occurs if there are variables in one dataset that aren't in the other. Let's add an age variable to our first dataset but not the second and see what happens when we try to merge them
```{r,echo = FALSE}
Farmer_dataA$Age <- sample(18:65, 12)
```
```{r, echo = FALSE}
Farmer_dataA
```
```{r, exercise = TRUE, error = TRUE}
rbind(Farmer_dataA, Farmer_dataB)
```
We get an error instead. This is because `rbind` requires that there are exactly the same number of columns in both data sets.
To get around this issue, there is the `bind_rows` function from dyplr. This matches the columns by their names (so make sure those are the same in both datasets still), and if there are any column not present in both datasets, it will just fill this with NA in the data set where it is not present.
```{r}
bind_rows(Farmer_dataA, Farmer_dataB)
```
### Column binding
Column binding on the other hand would be utilised when we have the same set of observations, but the variables are different. Perhaps we have taken additional measurements at a later point and want to bring this together with the original data.
For example, let's look to our plot data and bring in data on the yield of the crops grown on each plot. Data we have collected at a later point.
```{r}
Plot_data2
```
For column binding, as you may expect the base r function is `cbind` and the dplyr alternative is `bind_cols`. Unlike with row binding, there is not really any difference between how the functions operate.
Both require the same number of rows in order to operate. These should also be in the same order, if they are not you could use `arrange` first to make sure that they are.
```{r}
cbind(Plot_data, Plot_data2)
```
```{r}
bind_cols(Plot_data, Plot_data2)
```
Now you will notice that because both datasets contained `plot_id`, our resulting data table has two id columns unhelpfully names "plot_id...1" and "plot_id...6". This is because column binding will not merge information from columns that have the same name, rather they will just change the names.
Therefore, if we were to use column binding it would be a good idea to drop the plot_id from one dataset and then perform the bind.
We can use `select` and then put a `-` before the name of the variable to remove it from the data
Though of course this stresses the importance that these ids MUST be identical in both data sets. The same numbers, and those numbers mean the same thing.
```{r}
Plot_data3 <- Plot_data2 %>%
select(-plot_id)
```
Now when we bind the data sets. We keep plot_id as normal
```{r}
Plot_data <- bind_cols(Plot_data, Plot_data3)
Plot_data
```
This binding may have been more smoothly achieved if we had actually done a join instead as we shall see later in the workbook.
## Full join
Joining entails the more traditional form of data merging, we are bringing together data from multiple related data tables and these data tables do not contain the same levels of information.
Recall from the session that a full join will keep the rows from both data sets, regardless of whether or not there is a match.
Let's again look at merging our two pieces of plot data but rather than having yield data for all of our plots, we have it for 40 out of 50 of them.
```{r, echo = FALSE}
set.seed(2)
Plot_data2 <- Plot_data2[-sample(1:50,10),]
```
```{r}
Plot_data
```
```{r}
Plot_data2
```
From the dyplr package we would want to use the `full_join` function.
We first specify the two datasets we want to merge.
Then we use the `by = ` argument to tell R what variables are we using to merge this data. In this case we want to use the primary key of these two data sets, plot_id. Remember that a primary key is the key that uniquely identifies each row in your data. With our plot level data sets, each row is uniquely identified by that plot identification number.
```{r, echo = FALSE}
Plot_data <- select(Plot_data, - yield_kg_ha)
```
```{r}
full_join(Plot_data, Plot_data2, by = "plot_id")
```
We can see how this was quicker and simpler than trying to achieve the same result with column binding.
Note that as both pieces of data are at the plot level, we had a 1 to 1 relationship between the data sets. For every plot in table 1 there will be no more than 1 plot in table 2.
Now notice that there are some NA values for yield. This is because there was not a match between the two data sets. We did not have yield for plots 6, 8, etc. But because we used a full join, these rows are kept and not excluded. All data remains intact and we still have 50 rows.
Let’s try this again with an additional plot in Plot_data2, that does not exist in plot_data.
```{r, echo = FALSE}
Plot_data2[41,] <- list(51, 320)
```
As we know it is our 51st plot, let’s take a look at the details of this plot only. We can see that it has a yield of 320 kg/ha.Note that is is row 41 not 51 because our plot_data2 was missing 10 rows.
```{r}
Plot_data2[41,]
```
When joining this data now, plot 51 is kept in the data, but we would have all this missing data including farmer_id. Not always particularly helpful.
```{r}
Plot_data4 <- full_join(Plot_data, Plot_data2, by = "plot_id")
Plot_data4[51,]
```
I tend to find full join most useful when using it in a similar fashion to rbind, where I have similar datasets with many of the same variables but additional observations and perhaps even additional variables for existing observations.
. Let’s have another look at merging two instances of our farmer data. This time it’s slightly different: only some of the rows are shared between the two datasets and one dataset has an additional variable, household size.
```{r, echo = FALSE}
set.seed(3)
Farmer_dataC <- Farmer_data[1:12,]
Farmer_dataD <- Farmer_data[9:20,]
Farmer_dataD[13,] <- list(21, 2, "Sam")
Farmer_dataD[14,] <- list(22, 5, "Dave")
Farmer_dataD$hhsize <- sample(1:10,14, replace = T)
```
```{r}
Farmer_dataC
```
```{r}
Farmer_dataD
```
So in our second data set we have 2 additional farmers Sam and Dave, and then also additional household size information for farmers 9 to 12.
So let's join these together and see what happens
```{r}
full_join(Farmer_dataC, Farmer_dataD, by = "farmer_id")
```
Now we have 22 rows as we would expect, one for each farmer. But something has gone wrong with our village and name variables. This is because we only told R to match on farmer_id. While this logically makes sense, it means that R will think that other column with the same name are not identical, when in our case they are. As a result we get columns like name.x and name.y.
In order to avoid this we also need to include any common columns between our two data sets by listing them.
```{r}
full_join(Farmer_dataC, Farmer_dataD, by = c("farmer_id", "village_id", "name"))
```
This seems to solve the problem and we have kept information from both data tables even where we were missing household size information.
If our variables were called something different then we can control for this as well. For example, say in our previous example, in the second dataset plot_id was instead called plot_name. Then we could have written this instead. with the column name in the first data set on the left, and the other column name on the right.
```{r, eval = FALSE}
full_join(Plot_data, Plot_data2, by = c("plot_id" = "plot_name"))
```
## Left/right join
A left or right join will keep all rows from whichever is deemed to be the primary dataset. With regards to the function,for a left join this means it will keep everything from the data set written to the left. A right join will keep all rows from the dataset to the right.
This is regardless of whether or not there is a match.
So if we were to join the two sets of plot data that we had on the previous page. We could use a left join to stop us from adding in that 51st plot for which we have very little data.
```{r}
left_join(Plot_data, Plot_data2, by = "plot_id")
```
We have kept all 50 rows from the first set of plot data, including those without any yield data in the second set. All while removing that 51st plot from our data.
If we were to instead keep the arguments the same but change this to a right join we would see 41 rows. The 40 plots for which we have yield information and also the other variables. Plus the other plot for which we only have yield data. Therefore removing the 10 plots where we have that original information but not yield.
```{r}
right_join(Plot_data, Plot_data2, by = "plot_id")
```
## Inner join
An inner join trims this down further still and will only keep the rows where there is a corresponding match between each data set.
So in our plot example we would drop both the plots which have no yield data, and the 51st plot for which we have no other information.
```{r}
inner_join(Plot_data, Plot_data2, by = "plot_id")
```
As a result we have 40 rows. Because we had 40 plots were information was available from both datasets.
## Brining it all together
Finally, let's end by bring these concepts together. First by merging data down a level, and then an ambitious example of bringing data up right down from the bottom level (fertilisers) right up to the top (villages)
### Example 1
First let's bring some data down a level. For instance let's say we want to analyse some data at the plot level but we want to group this data by village to generate a village level average plot size.
But we do not know which farmers lives in which village just by looking at this plot level data.
But we do have farmer_id which can link us up with the farmer data. A data set which does include the farmer data, which includes the village id as a variable.
So we need to merge information from the farmer data and the plot data using this link. Here our foreign key is farmer_id as it does not uniquely identify our rows in the plot data. But it is identifying a particular unit of observation, the farmer. This foreign key (farmer_id) links up to the primary key of the farmer data (farmer_id).
This is an example of a 1 to Many relationship as for each individual farmer there can be many plots.
Now you may have noticed that we have 20 farmers in our data, but when we have grouped our plots by farmer, we had only 19 groups. That is because we have one farmer without any plots.
This helps inform our decision on what type of join we need to use.
We are primarily concerned with our plot data, so those are the rows we want to keep.We don't want additional rows with little information of use. So we do not want to use a full join because we would get a row for this additional farmer despite them having no plots in our data. This row would be useless.
In this example we don't have any plots belonging to farmers not in our data, but if this were the case, again we would have rows with unhelpful information as we would not be able to identify which village they lived in because they are not in the records.
So in this case, we need an inner join. Because we only want to keep the rows that have a match, because then we will have both plot size and village information.
```{r}
Plot_data <- Plot_data %>%
inner_join(Farmer_data, by = "farmer_id")
Plot_data
```
So we have successfully brought down the farmer data, including both the village_id and the farmers names.
We could extend this further and bring down the village level information as well.
```{r}
Plot_data <- Plot_data %>%
inner_join(Village_data, by = "village_id")
Plot_data
```
Now we can use `group_by` and `summarise` to calculate that village level plot average
```{r}
Plot_data %>%
group_by(village_name) %>%
summarise(avg_plot_size = mean(size, na.rm = TRUE))
```
### Example 2
Now lets start moving data the other way. In the village level data, i want to calculate the average total expenditure on fertilisers. So i have to start with fertiliser data and bring this all the way up to the village level.
Firstly I of course need to match fertilisers and plots. This is actually an example of a many to many relationship because a plot can have up to two fertilisers and a fertiliser can be used on many different plots.
Now as our plot data is at that plot level we actually have the names of the two fertilisers in separate columns. Therefore we cannot match them straight away, We could do it in staggered steps, first bring in the costs for fertiliser 1 and then the costs for fertiliser 2. Though we would then have to reshape the data to long so we could join those two costs columns together.
It would be simpler if we actually reshaped our plot data first and then merged it with the fertiliser data.
So using what we learned in the previous sessions. Let's create a long version of our plot data.
```{r, echo = FALSE}
Plot_data <- Plot_data %>%select(plot_id:fertiliser_2)
```
```{r}
Long_plots <- Plot_data %>%
pivot_longer(
cols = fertiliser_1:fertiliser_2,
names_to = "Fertiliser_no",
values_to = "fertiliser_id"
)
Long_plots
```
Now we can bring in the fertiliser data. Again using left join, as we want to prioritise the plot information.
```{r}
Long_plots <- Long_plots %>%
left_join(Fertiliser_data, by = "fertiliser_id")
Long_plots
```
Okay so we now have our price information in the data. However there is one further step we need to make before we can go further up the levels of this data. Price is calculated at per ha, so to get total expenditure we need to multiply this by the size of the plot. we can do this using mutate
```{r}
Long_plots <- Long_plots %>%
mutate(price = price_per_ha * size)
```
Now we can start summarising and bring the data up in sequential steps.
Starting with going from that long plot data, technically at the fertiliser level, back to normal plot level.
We can do this all in one pipe.
Starting with long plots we group by plot_id.
Then we summarise to calculate a plot level total expenditure on fertilisers.
Finally we right join onto our plot data. We use a right join because we start with long plots and pipe through therefore it will always be the first (left) argument in that join. But we only want the rows from the plot data so we want to join to the right.
```{r}
Plot_data <- Long_plots %>%
group_by(plot_id) %>%
summarise(price_total = sum(price, na.rm = T)) %>%
right_join(Plot_data, by = "plot_id")
Plot_data
```
We now simply repeat the process to bring this up to farmer level. Simply switching out our data arguments and change to use the appropriate keys that link the data sets.
```{r}
Farmer_data <- Plot_data %>%
group_by(farmer_id) %>%
summarise(price_total = sum(price_total, na.rm = T)) %>%
right_join(Farmer_data, by = "farmer_id")
Farmer_data
```
Finally, we repeat this one more time to bring it all up to the village level.
```{r}
Village_data <- Farmer_data %>%
group_by(village_id) %>%
summarise(price_average = mean(price_total, na.rm = T)) %>%
right_join(Village_data, by = "village_id")
Village_data
```
If we wanted to get fancy this could have all been done in the one pipe.
Note that as we start with Plot_data we do not need to join on to it like we did before, and because plot data already has the farmer_id we can go straight to summarising the data at that level first before moving it upwards to village.
```{r, echo = FALSE}
Plot_data <- Plot_data %>%select(-price_total)
Farmer_data <- Farmer_data %>%select(-price_total)
Village_data <- Village_data %>%select(-price_average)
```
```{r}
Plot_data %>%
pivot_longer(
cols = fertiliser_1:fertiliser_2,
names_to = "Fertiliser_no",
values_to = "fertiliser_id"
) %>%
left_join(Fertiliser_data, by = "fertiliser_id") %>%
mutate(price = price_per_ha*size) %>%
group_by(farmer_id) %>%
summarise(price_total = sum(price, na.rm = TRUE)) %>%
right_join(Farmer_data, by = "farmer_id") %>%
group_by(village_id) %>%
summarise(price_average = mean(price_total, na.rm = TRUE)) %>%
right_join(Village_data, by = "village_id")
```
## External links and resouces
[R for data sciene - relational data](https://r4ds.had.co.nz/relational-data.html)
[Relational Databases video](https://www.youtube.com/watch?v=C3icLzBtg8I&t=174s)
[Relational Databases video 2](https://www.youtube.com/watch?v=NvrpuBAMddw&t=29s)
[Data school](https://www.thedataschool.co.uk/harry-cooney/what-are-data-joins/)
[Displayr](https://www.displayr.com/what-is-data-merging/ )
[Data merging in Excel](https://www.trifacta.com/merge-data-in-excel/)
[Dyplr Joins help page](https://dplyr.tidyverse.org/reference/mutate-joins.html)
[Blog post on joins with dyplr](https://hollyemblem.medium.com/joining-data-with-dplyr-in-r-874698eb8898)
[Summarise with dyplr](https://dplyr.tidyverse.org/reference/summarise.html)
[Summarise with dyplr - Multiple columns at once](https://dplyr.tidyverse.org/reference/summarise_all.html)