forked from b-rodrigues/rap4all
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtargets.qmd
1731 lines (1432 loc) · 57.1 KB
/
targets.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Build automation with targets
We are finally ready to actually build a pipeline. For this, we are going to be
using a package called `{targets}` [@landau2021] which is a so-called "build
automation tool".
If you go back to the reproducibility iceberg, you will see that we are quite
low now.
Without a build automation tool, a pipeline is nothing but a series of scripts
that get called one after the other, or perhaps the pipeline is only one very
long script that does the required operations successfully.
There are several problems with this approach, so let’s see how build automation
can help us.
## Introduction
Script-based workflows are problematic for several reasons. The first is that
scripts can, and will, be executed out of order. You can mitigate some of the
problems this can create by using pure functions, but you still need to make
sure not to run the scripts out of order. But what does that actually mean?
Well, suppose that you changed a function, and only want to re-execute the parts
of the pipeline that are impacted by that change. But this supposes that you can
know, in your head, which part of the script was impacted and which was not. And
this can be quite difficult to figure out, especially when the pipeline is huge.
So you will run certain parts of the script, and not others, in the hope that
you don’t need to re-run everything.
Another issue is that pipelines written as scripts are usually quite difficult
to read and understand. To mitigate this, what you'd typically do is write a lot
of comments. But here again you face the problem of needing to maintain these
comments, and once the comments and the code are out of synch... the problems
start (or rather, they continue).
Running the different parts of the pipeline in parallel is also very complicated
if your pipeline is defined as script. You would need to break the script into
independent parts (and make really sure that these parts are independent) and
execute them in parallel, perhaps using a separate R session for each script.
The good news is that if you followed the advice from this book you have been
using functional programming and so your pipeline is a series of pure function
calls, which simplifies running the pipeline in parallel.
But by now you should know that software engineers also faced similar problems
when they needed to build their software, and you should also suspect that they
likely came up with something to alleviate these issues. Enter build automation
tools.
When using a build automation tool, what you end up doing is writing down a
recipe that defines how the source code should be "cooked" into the software (or
in our case, a report, a cleaned dataset or any data product).
The build automation tool then tracks:
- any change in any of the code. Only the outputs that are affected by the changes you did will be re-computed (and their dependencies as well);
- any change in any of the tracked files. For example, if a file gets updated daily, you can track this file and the build automation tool will only execute the parts of the pipeline affected by this update;
- which parts of the pipeline can safely run in parallel (with the option to thus run the pipeline on multiple CPU cores).
Just like many of the other tools that we have encountered in this book, what
build automation tools do is allow you to not have to rely on your brain. You
*write down* the recipe once, and then you can focus again on just the code of
your actual project. You shouldn't have to think about the pipeline itself, nor
think about how to best run it. Let your computer figure that out for you, it's
much better at such tasks than you.
## {targets} quick-start
First thing's first: to know everything about the `{targets}` package, you
should read the excellent [`{targets}`
manual](https://books.ropensci.org/targets/)^[https://is.gd/VS6vSs].
Everything's in there. So what I'm going to do is really just give you a very
quick intro to what I think are really the main points you should know about to
get started.
Let's start with a "hello-world" type pipeline. Create a new folder called
something like `targets_intro/`, and start a fresh R session in it. For now,
let’s ignore `{renv}`. We will see how `{renv}` works together with `{targets}`
to provide an (almost reproducible) pipeline later. In that fresh session inside the
`targets_intro/` run the following line:
```{r, eval = FALSE}
targets::tar_script()
```
this will create a template `_targets.R` file in that directory. This is the
file in which we will define our pipeline. Open it in your favourite editor.
A `_targets.R` pipeline is roughly divided into three parts:
- first is where packages are loaded and helper functions are defined;
- second is where pipeline-specific options are defined;
- third is the pipeline itself, defined as a series of *targets*.
Let’s go through all these parts one by one.
### _targets.R’s anatomy
The first part of the pipeline is where packages and helper functions get
loaded. In the template, the very first line is a `library(targets)` call
followed by a function definition. There are two important things here that you
need to understand.
If your pipeline needs, say, the `{dplyr}` package to run, you could write
`library(dplyr)` right after the `library(targets)` call. However, it is best to
actually do as in the template, and load the packages using
`tar_option_set(packages = "dplyr")`. This is because if you execute the
pipeline in parallel, you need to make sure that all the packages are available
to all the workers (typically, one worker per CPU core). If you load the
packages at the top of the `_targets.R` script, the packages will be available
for the original session that called `library(...)`, but not to any worker
sessions spawned for parallel execution.
So, the idea is that at the very top of your script, you only load the
`{targets}` library and other packages that are required for running the
pipeline itself (as we shall see in coming sections). But packages that are
required by functions that are running inside the pipeline should ideally be
loaded as in the template. Another way of saying this: at the top of the script,
think "pipeline infrastructure" packages (`{targets}` and some others), but
inside `tar_option_set()` think "functions that run inside the pipeline"
packages.
Part two is where you set some global options for the pipeline. As discussed
previously, this is where you should load packages that are required by the
functions that are called inside the pipeline. I won’t list all the options
here, because I would simply be repeating what’s in the
[documentation](https://docs.ropensci.org/targets/reference/tar_option_set.html)^[https://is.gd/lm4QoO].
This second part is also where you can define some functions that you might need
for running the pipeline. For example, you might need to define a function to
load and clean some data: this is where you would do so. We have developed a
package, so we do not need such a function, we will simply load the data from
the package directly. But sometimes your analysis doesn’t require you to write
any custom functions, or maybe just a few, and perhaps you don’t see the benefit
of building a package just for one or two functions. So instead, you have two
other options: you either define them directly inside the `_targets.R` script,
like in the template, or you create a `functions/` folder next to the
`_targets.R` script, and put your functions there. It’s up to you, but I prefer
this second option. In the example script, the following function is defined:
```{r, eval = F}
summarize_data <- function(dataset) {
colMeans(dataset)
}
```
Finally, comes the pipeline itself. Let’s take a closer look at it:
```{r, eval = F}
list(
tar_target(data,
data.frame(x = sample.int(100),
y = sample.int(100))),
tar_target(data_summary,
summarize_data(data)) # Call your custom functions.
)
```
The pipeline is nothing but a list (told you lists where a very important
object) of *targets*. A target is defined using the `tar_target()`
function and has at least two inputs: the first is the name of the target
(without quotes) and the second is the function that generates the target. So a
target defined as `tar_target(y, f(x))` can be understood as `y <- f(x)`. The next
target can use the output of the previous target as an input, so you could have
something like `tar_target(z, f(y))` (just like in the template).
## A pipeline is a composition of pure functions
You can run this pipeline by typing `tar_make()` in a console:
```{r, eval = F}
targets::tar_make()
```
```r
• start target data
• built target data [0.82 seconds]
• start target data_summary
• built target data_summary [0.02 seconds]
• end pipeline [1.71 seconds]
```
The pipeline is done running! So, now what? This pipeline simply built some
summary statistics, but where are they? Typing `data_summary` in the console to try
to inspect this output results in the following:
```{r, eval = F}
data_summary
```
```r
Error: object 'data_summary' not found
```
What is going on?
First, you need to remember our chapter on functional programming. We want our
pipeline to be a sequence of pure functions. This means that our pipeline
running successfully should not depend on anything in the global environment
(apart from loading the packages in the first part of the script, and the
options set with `tar_option_set()` for the others) and it should not change
anything outside of its scope. This means that the pipeline should not change
anything in the global environment either. This is exactly how a `{targets}`
pipeline operates. A pipeline defined using `{targets}` will be pure and so the
output of the pipeline will not be saved in the global environment. Now,
strictly speaking, the pipeline is not exactly pure. Check the folder that
contains the `_targets.R` script. There should now be a `_targets/` folder in
there as well. If you go inside that folder, and then open the `objects/`
folder, you should see two objects, `data` and `data_summary`. These are the
outputs of our pipeline.
So each target that is defined inside the pipeline gets saved there in the
`.rds` format. This is an R-specific format that you can use to save *any* type
of object. It doesn’t matter what it is: a simple data frame, a fitted model, a
ggplot, whatever, you can write any R object to disk in this format using the
`saveRDS()` function, and then read it back into another R session using
`readRDS()`. `{targets}` makes use of these two functions to save every target
computed by your pipeline, and simply retrieves them from the `_targets/` folder
instead of recomputing them. Keep this in mind if you use Git to version the
code of your pipeline (which you are doing of course), and add the `_targets/`
folder to the `.gitignore` (unless you really want to also version it, but it
shouldn’t be necessary).
So because the pipeline is pure, and none of its outputs get saved into the
global environment, calling `data_summary` results in the error above. So to
retrieve the outputs you should use `tar_read()` or `tar_load()`. The
difference is that `tar_read()` simply reads the output and shows it in the
console but `tar_load()` reads and saves the object into the global environment.
So to retrieve our `data_summary` object let’s use `tar_load(data_summary)`:
```{r, eval = F}
tar_load(data_summary)
```
Now, typing `data_summary` shows the computed output:
```{r, eval = F}
data_summary
```
```{r, eval = F}
x y
50.5 50.5
```
It is possible to load all the outputs using `tar_load_everything()` so
that you don’t need to load each output one by one.
Before continuing with more `{targets}` features, I want to really stress the
fact that the pipeline is the composition of pure functions. So functions that
only have a side-effect will be difficult to handle. Examples of such functions
are functions that read data, or that print something to the screen. For
example, plotting in base R consists of a series of calls to functions with
side-effects. If you open an R console and type `plot(mtcars)`, you will see a
plot. But the function `plot()` does not create any output. It just prints a
picture on your screen, which is a side-effect. To convince yourself that
`plot()` does not create any output and only has a side-effect, try to save the
output of `plot()` in a variable:
```{r, eval = F}
a <- plot(mtcars)
```
doing this will show the plot, but if you then call `a`, the plot will not
appear, and instead you will get NULL:
```{r, eval = F}
a
```
```{r, eval = F}
NULL
```
This is also why saving plots in R is awkward, it’s because there’s no object to
actually save!
So because `plot()` is not a pure function, if you try to use it in a
`{targets}` pipeline, you will get `NULL` as well when loading the target that
should be holding the plot. To see this, change the list of targets like this:
```{r, eval = F}
list(
tar_target(data,
data.frame(x = sample.int(100),
y = sample.int(100))),
tar_target(data_summary,
summarize_data(data)), # Call your custom functions.
tar_target(
data_plot,
plot(data)
)
)
```
I’ve simply added a new target using `tar_target()` at the end, to generate a
plot. Run the pipeline again using `tar_make()` and then type
`tar_load(data_plot)` to load the `data_plot` target. But typing `data_plot`
only shows `NULL` and not the plot!
There are several workarounds for this. The first is to use `ggplot()` instead.
This is because the output of `ggplot()` is an object of type `ggplot`. You can
do something like `a <- ggplot() + etc...` and then type `a` to see the plot.
Doing `str(a)` also shows the underlying list holding the structure of the plot,
as a list.
The second workaround is to save the plot to disk. For this, you need to write a
new function, for example:
```{r, eval = F}
save_plot <- function(filename, ...){
png(filename = filename)
plot(...)
dev.off()
}
```
If you put this in the `_targets.R` script, before defining the list of
`tar_target` objects, you could use this instead of `plot()` in the last target:
```{r, eval = F}
summarize_data <- function(dataset) {
colMeans(dataset)
}
save_plot <- function(filename, ...){
png(filename = filename)
plot(...)
dev.off()
filename
}
# Set target-specific options such as packages.
tar_option_set(packages = "dplyr")
# End this file with a list of target objects.
list(
tar_target(data,
data.frame(x = sample.int(100),
y = sample.int(100))),
tar_target(data_summary,
summarize_data(data)), # Call your custom functions.
tar_target(
data_plot,
save_plot(
filename = "my_plot.png",
data),
format = "file")
)
```
After running this pipeline you should see a file called `my_plot.png` in the
folder of your pipeline. If you type `tar_load(data_plot)`, and then `data_plot`
you will see that this target returns the `filename` argument of `save_plot()`.
This is because a target needs to return something, and in the case of functions
that save a file to disk returning the path where the file gets saved is
recommended. This is because if I then need to use this file in another target,
I could do `tar_target(x, f(data_plot))`. Because the `data_plot` target returns
a path, I can write `f()` in such a way that it knows how to handle this path.
If instead I write `tar_target(x, f("path/to/my_plot.png"))`, then `{targets}`
would have no way of knowing that the target `x` depends on the target
`data_plot`. The dependency between these two targets would break. Hence why the
first option is preferable.
Finally, you will have noticed that the last target also has the option `format =
"file"`. This will be topic of the next section.
It is worth noting that the `{ggplot2}` package includes a function to save
`ggplot` objects to disk called `ggplot2::ggsave()`. So you could define two
targets, one to compute the `ggplot` object itself, and another to generate a
`.png` image of that `ggplot` object.
## Handling files
In this section, we will learn how `{targets}` handles files. First, run the
following lines in the folder that contains the `_targets.R` script that we’ve
been using up until now:
```{r, eval = F}
data(mtcars)
write.csv(mtcars,
"mtcars.csv",
row.names = F)
```
This will create the file `"mtcars.csv"` in that folder. We are going to use
this in our pipeline.
Write the pipeline like this:
```{r, eval = F}
list(
tar_target(
data_mtcars,
read.csv("mtcars.csv")
),
tar_target(
summary_mtcars,
summary(data_mtcars)
),
tar_target(
plot_mtcars,
save_plot(
filename = "mtcars_plot.png",
data_mtcars),
format = "file")
)
```
You can now run the pipeline and will get a plot at the end. The problem
however, is that the input file `"mtcars.csv"` is not being tracked for changes.
Try to change the file, for example by running this line in the console:
```{r, eval = F}
write.csv(head(mtcars), "mtcars.csv", row.names = F)
```
If you try to run the pipeline again, our changes to the data are ignored:
```r
✔ skip target data_mtcars
✔ skip target plot_mtcars
✔ skip target summary_mtcars
✔ skip pipeline [0.1 seconds]
```
As you can see, because `{targets}` is not tracking the changes in the
`mtcars.csv` file, from its point of view nothing changed. And thus the
pipeline gets skipped because according to `{targets}`, it is up-to-date.
Let’s change the csv back:
```{r, eval = F}
write.csv(mtcars, "mtcars.csv", row.names = F)
```
and change the first target such that the file gets tracked. Remember that
targets need to be pure functions and return something. So we are going to
change the first target to simply return the path to the file, and use the
`format = "file"` option in `tar_target()`:
```{r, eval = F}
path_data <- function(path){
path
}
list(
tar_target(
path_data_mtcars,
path_data("mtcars.csv"),
format = "file"
),
tar_target(
data_mtcars,
read.csv(path_data_mtcars)
),
tar_target(
summary_mtcars,
summary(data_mtcars)
),
tar_target(
plot_mtcars,
save_plot(filename = "mtcars_plot.png",
data_mtcars),
format = "file")
)
```
To drive the point home, I use a function called `path_data()` which takes a
path as an input and simply returns it. This is totally superfluous, and you
could define the target like this instead:
```{r, eval = F}
tar_target(
path_data_mtcars,
"mtcars.csv",
format = "file"
)
```
This would have exactly the same effect as using the `path_data()` function.
So now we got a target called `path_data_mtcars` that returns nothing but the
path to the data. But because we’ve used the `format = "file"` option,
`{targets}` now knows that this is a file that must be tracked. So any change on
this file will be correctly recognised and any target that depends on this input
file will be marked as being out-of-date. The other targets are exactly the
same.
Run the pipeline now using `tar_make()`. Now, change the input file again:
```{r, eval = F}
write.csv(head(mtcars),
"mtcars.csv",
row.names = F)
```
Now, run the pipeline again using `tar_make()`: this time you should
see that `{targets}` correctly identified the change and runs the pipeline again
accordingly!
## The dependency graph
As you’ve seen in the previous section (and as I told you in the introduction)
`{targets}` keeps track of changes in files, but also in the functions that you
use. Any change to the code of any of these functions will result in `{targets}`
identifying which targets are now out-of-date and which should be re-computed
(alongside any other target that depends on them). It is possible to visualise
this using `tar_visnetwork()`. This opens an interactive network graph
in your web browser that looks like this:
::: {.content-hidden when-format="pdf"}
<figure>
<img src="images/targets_visnetwork.png"
alt="This image opens in your web-browser."></img>
<figcaption>This image opens in your web-browser.</figcaption>
</figure>
:::
::: {.content-visible when-format="pdf"}
```{r, echo = F, out.height="300px"}
#| fig-cap: "This image opens in your web-browser."
knitr::include_graphics("images/targets_visnetwork.png")
```
:::
In the image above, each target has been computed, so they are all up-to-date.
If you now change the input data, here is what you will see instead:
::: {.content-hidden when-format="pdf"}
<figure>
<img src="images/targets_visnetwork_outdated.png"
alt="Because the input data was changed, we need to run the pipeline again."></img>
<figcaption>Because the input data was changed, we need to run the pipeline again.</figcaption>
</figure>
:::
::: {.content-visible when-format="pdf"}
```{r, echo = F, out.height="300px"}
#| fig-cap: "Because the input data was changed, we need to run the pipeline again."
knitr::include_graphics("images/targets_visnetwork_outdated.png")
```
:::
Because all the targets depend on the input data, we need to re-run everything.
Let's run the pipeline again to update all the targets using `tar_make()` before
continuing.
Now let's add another target to our pipeline, one that does not depend on the
input data. Then, we will modify the input data again, and call
`tar_visnetwork()` again. Change the pipeline like so:
```{r, eval = F}
list(
tar_target(
path_data_mtcars,
"mtcars.csv",
format = "file"
),
tar_target(
data_iris,
data("iris")
),
tar_target(
summary_iris,
summary(data_iris)
),
tar_target(
data_mtcars,
read.csv(path_data_mtcars)
),
tar_target(
summary_mtcars,
summary(data_mtcars)
),
tar_target(
plot_mtcars,
save_plot(
filename = "mtcars_plot.png",
data_mtcars),
format = "file")
)
```
Before running the pipeline, we can call `tar_visnetwork()` again to
see the entire workflow:
::: {.content-hidden when-format="pdf"}
<figure>
<img src="images/targets_visnetwork_iris.png"
alt="We clearly see that the pipeline has two completely independent parts."></img>
<figcaption>We clearly see that the pipeline has two completely independent parts.</figcaption>
</figure>
:::
::: {.content-visible when-format="pdf"}
```{r, echo = F, out.height="300px"}
#| fig-cap: "We clearly see that the pipeline has two completely independent parts."
knitr::include_graphics("images/targets_visnetwork_iris.png")
```
:::
We can see that there are now two independent parts, as well as two unused
functions, `path_data()` and `summ()` which we could remove.
Running the pipeline using `tar_make()` builds everything
successfully. Let’s add the following target, just before the very
last one:
```{r, eval = F}
tar_target(
list_summaries,
list(
"summary_iris" = summary_iris,
"summary_mtcars" = summary_mtcars
)
),
```
This target creates a list with the two summaries that we compute. Call
`tar_visnetwork()` again:
::: {.content-hidden when-format="pdf"}
<figure>
<img src="images/targets_visnetwork_list_summaries.png"
alt="The two separate workflows end up in one output."></img>
<figcaption>The two separate workflows end up in one output.</figcaption>
</figure>
:::
::: {.content-visible when-format="pdf"}
```{r, echo = F, out.height="300px"}
#| fig-cap: "The two separate workflows end up in one output."
knitr::include_graphics("images/targets_visnetwork_list_summaries.png")
```
:::
Finally, run the pipeline one last time to compute the final output.
## Running the pipeline in parallel
`{targets}` makes it easy to run independent parts of our pipeline in parallel.
In the example from before, it was quite obvious to know which parts were
independent, but when the pipeline grows in complexity, it can be very difficult
to see which parts are independent.
Let’s now run the example from before in parallel. But first, we need to create
a function that takes some time to run. `summary()` is so quick that running
both of its calls in parallel is not worth it (and would actually even run
slower, I’ll explain why at the end). Let’s define a new function called
`slow_summary()`:
```{r, eval = F}
slow_summary <- function(...){
Sys.sleep(30)
summary(...)
}
```
and replace every call to `summary()` with `slow_summary()` in the
pipeline:
```{r, eval = F}
list(
tar_target(
path_data_mtcars,
"mtcars.csv",
format = "file"
),
tar_target(
data_iris,
data("iris")
),
tar_target(
summary_iris,
slow_summary(data_iris)
),
tar_target(
data_mtcars,
read.csv(path_data_mtcars)
),
tar_target(
summary_mtcars,
slow_summary(data_mtcars)
),
tar_target(
list_summaries,
list(
"summary_iris" = summary_iris,
"summary_mtcars" = summary_mtcars
)
),
tar_target(
plot_mtcars,
save_plot(filename = "mtcars_plot.png",
data_mtcars),
format = "file")
)
```
here’s what the pipeline looks like before running:
::: {.content-hidden when-format="pdf"}
<figure>
<img src="images/targets_visnetwork_slow_summary.png"
alt="slow_summary() is used instead of summary()."></img>
<figcaption>slow_summary() is used instead of summary().</figcaption>
</figure>
:::
::: {.content-visible when-format="pdf"}
```{r, echo = F, out.height="300px"}
#| fig-cap: "slow_summary() is used instead of summary()."
knitr::include_graphics("images/targets_visnetwork_slow_summary.png")
```
:::
(You will also notice that I’ve removed the unneeded functions, `path_data()`
and `summ()`).
Running this pipeline sequentially will take about a minute, because each call
to `slow_summary()` takes 30 seconds. To re-run the pipeline completely from
scratch, call `tar_destroy()`. This will make all the targets outdated.
Then, run the pipeline from scratch with `tar_make()`:
```{r, eval = F}
targets::tar_make()
```
```r
• start target path_data_mtcars
• built target path_data_mtcars [0.18 seconds]
• start target data_iris
• built target data_iris [0 seconds]
• start target data_mtcars
• built target data_mtcars [0 seconds]
• start target summary_iris
• built target summary_iris [30.26 seconds]
• start target plot_mtcars
• built target plot_mtcars [0.16 seconds]
• start target summary_mtcars
• built target summary_mtcars [30.29 seconds]
• start target list_summaries
• built target list_summaries [0 seconds]
• end pipeline [1.019 minutes]
```
Since computing `summary_iris` is completely independent of `summary_mtcars`,
these two computations could be running at the same time on two separate CPU
cores. To do this, we need to first load two additional packages, `{future}` and
`{future.callr}` at the top of the script. Then, we also need to call
`plan(callr)` before defining our pipeline. Here is what the complete
`_targets.R` looks like:
```{r, eval = F}
library(targets)
library(future)
library(future.callr)
plan(callr)
# Sometimes you gotta take your time
slow_summary <- function(...) {
Sys.sleep(30)
summary(...)
}
# Save plot to disk
save_plot <- function(filename, ...){
png(filename = filename)
plot(...)
dev.off()
filename
}
# Set target-specific options such as packages.
tar_option_set(packages = "dplyr")
list(
tar_target(
path_data_mtcars,
"mtcars.csv",
format = "file"
),
tar_target(
data_iris,
data("iris")
),
tar_target(
summary_iris,
slow_summary(data_iris)
),
tar_target(
data_mtcars,
read.csv(path_data_mtcars)
),
tar_target(
summary_mtcars,
slow_summary(data_mtcars)
),
tar_target(
list_summaries,
list(
"summary_iris" = summary_iris,
"summary_mtcars" = summary_mtcars
)
),
tar_target(
plot_mtcars,
save_plot(
filename = "mtcars_plot.png",
data_mtcars),
format = "file")
)
```
You can now run this pipeline in parallel using `tar_make_future()`
(and sequentially as well, just as usual with `tar_make()`). To run the
pipeline from scratch to test this, call `tar_destroy()` and then
`tar_make()` will build the entire pipeline from scratch:
::: {.content-visible when-format="pdf"}
\newpage
:::
```r
# Set workers = 2 to use 2 cpu cores
targets::tar_make_future(workers = 2)
```
```r
• start target path_data_mtcars
• start target data_iris
• built target path_data_mtcars [0.2 seconds]
• start target data_mtcars
• built target data_iris [0.22 seconds]
• start target summary_iris
• built target data_mtcars [0.2 seconds]
• start target plot_mtcars
• built target plot_mtcars [0.35 seconds]
• start target summary_mtcars
• built target summary_iris [30.5 seconds]
• built target summary_mtcars [30.52 seconds]
• start target list_summaries
• built target list_summaries [0.21 seconds]
• end pipeline [38.72 seconds]
```
As you can see, this was faster but not quite twice as fast, but almost. The
reason this isn’t exactly twice as fast is because there is some overhead to run
code in parallel. New R sessions have to be spawned by `{targets}`, data needs
to be transferred and packages must be loaded in these new sessions. This is why
it’s only worth parallelizing code that takes some time to run. If you decrease
the number of sleep seconds in `slow_summary(...)` (for example to 10), running
the code in parallel might be slower than running the code sequentially, because
of that overhead. But if you have several long-running computations, it’s really
worth the very small price that you pay for the initial setup. Let me re-iterate
again that in order to run your pipeline in parallel, the extra worker sessions
that get spawned by `{targets}` need to know which packages they need to load,
which is way you should load the packages your pipeline needs using:
```{r, eval = FALSE}
tar_option_set(packages = "dplyr")
```
## {targets} and RMarkdown (or Quarto)
It is also possible to compile documents using RMardown (or Quarto) with
`{targets}`. The way this works is by setting up a pipeline that produces the
outputs you need in the document, and then defining the document as a target to
be computed as well. For example, if you’re showing a table in the document,
create a target in the pipeline that builds the underlying data. Do the same for
a plot, or a statistical model. Then, in the `.Rmd` (or `.Qmd`) source file, use
`targets::tar_read()`to load the different objects you need.
Consider the following `_targets.R` file:
```{r, eval = FALSE}
library(targets)
tar_option_set(packages = c("dplyr", "ggplot2"))
list(
tar_target(
path_data_mtcars,
"mtcars.csv",
format = "file"
),
tar_target(
data_mtcars,
read.csv(path_data_mtcars)
),
tar_target(
summary_mtcars,
summary(data_mtcars)
),
tar_target(
clean_mtcars,
mutate(data_mtcars,
am = as.character(am))
),
tar_target(
plot_mtcars,
{ggplot(clean_mtcars) +
geom_point(aes(y = mpg,
x = hp,
shape = am))}
)
)
```
This pipeline loads the `.csv` file from before and creates a summary of the
data as well as plot. But we don’t simply want these objects to be saved as
`.rds` files by the pipeline, we want to be able to use them to write a document
(either in the `.Rmd` or `.Qmd` format). For this, we need another package,
called `{tarchetypes}`. This package comes many functions that allow you to
define new types of targets (these functions are called *target factories* in
`{targets}` jargon). The new target factory that we need is
`tarchetypes::tar_render()`. As you can probably guess from the name, this
function renders an `.Rmd` file. Write the following lines in an `.Rmd` file and
save it next to the pipeline:
````{verbatim}
---
title: "mtcars is the best data set"
author: "mtcars enjoyer"
date: today
---
## Load the summary
```{r}
tar_read(summary_mtcars)
```
````
Here is the `_targets.R` file again, where I now load `{tarchetypes}` at the top
and add a new target at the bottom:
```{r, eval = FALSE}
library(targets)
library(tarchetypes)
tar_option_set(packages = c("dplyr", "ggplot2"))
list(
tar_target(
path_data_mtcars,
"mtcars.csv",
format = "file"
),
tar_target(
data_mtcars,
read.csv(path_data_mtcars)
),
tar_target(
summary_mtcars,
summary(data_mtcars)
),
tar_target(
clean_mtcars,
mutate(data_mtcars,
am = as.character(am))
),
tar_target(
plot_mtcars,
{ggplot(clean_mtcars) +
geom_point(aes(y = mpg,
x = hp,
shape = am))}
),
tar_render(
my_doc,
"my_document.Rmd"
)
)
```
Running this pipeline with `tar_make()` will now compile the source `.Rmd` file
into an `.html` file that you can open in your web-browser. Even if you want to
compile the document into another format, I advise you to develop using the
`.html` format. This is because you can open the `.html` file in the
web-browser, and keep working on the source. Each time you run the pipeline
after you made some changes to the file, you simply need to refresh the
web-browser to see your changes. If instead you compile a Word document, you
will need to always close the file, and then re-open it to see your changes,