generated from jtr13/bookdown-template
-
Notifications
You must be signed in to change notification settings - Fork 32
/
Copy pathbook_conclusion.qmd
82 lines (70 loc) · 4.52 KB
/
book_conclusion.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# The end
Congratulations, you're done with the book and I hope you learned a thing or
two.
Part 1 focused on teaching you best practices, tools and techniques to make your
code as clean as possible. In part 2, I taught you how to turn your project into
a pipeline, and then how to make this pipeline reproducible using `{renv}` and
Docker. To summarise, here are all the things that we need to think about to
write a RAP:
- Write code that is as clean as possible: keep it DRY, document and test it well;
- Record package dependencies of the project;
- Record the R version that you use;
- Use a tool that builds the project for you;
- Record the computational environment.
If you tick all these boxes, you, or anyone else, should not have any problems
reproducing the results of your project. While it may seem that ticking these
boxes takes up valuable time from other tasks, if you use the techniques and
tools that I've showed you in part 1, this should not be the case, and you might
end up even gaining time. The only exception to this will be preparing a Docker
image, but if you supply at the very least an `renv.lock` file, creating a
Docker image to run a project could even be done much later, and only if it's
really needed (and maybe even by someone else).
# "So what?" {.unnumbered}
If you’ve reached this conclusion and are still thinking "meh, yeah,
reproducibility is nice and all, but... so what?" I hope that this last attempt
of mine to convince you that RAPs are important will be successful.
So, why bother building RAPs? Firstly, there are purely technical
considerations. It is not impossible that in quite a near future, we will work
on ever thinner clients while the heavy-duty computations will run on the cloud.
Should this be the case, being comfortable with the topics discussed in this
book will be valuable. Also, in this very near future, large language models
will be able to set up most, if not all, of the required boilerplate code to set
up a RAP. This means that you will be able to focus on analysis, but you still
need to understand what are the different pieces of a RAP, and how they fit
together, in order to understand the code that the large language model prepared
for you, but also to revise it if needed. And it is not a stretch to imagine
that simple analyses could be taken over by large language models as well. So
you might very soon find yourself in a position where you will not be the one
doing an analysis and setting up a RAP, but instead check, verify and adjust an
analysis and a RAP built by an AI. Being familiar with the concepts laid out in
this book will help you successfully perform these tasks in a world where every
data scientist will have AI assistants.
But more importantly, the following factors are inherently part of data analysis:
- transparency;
- sustainability;
- scalability.
It doesn’t matter if you’re working in research, for a public institution or a
private sector company: the three points above are incredibly important and it’s
impossible to perform data analysis without taking these into consideration,
regardless of whether AIs take over some, or most, of the tasks you perform
today. In the case of research, the *publish or perish* model has distorted
incentives so much that unfortunately a lot of researchers are focused on
getting published as quickly as possible, and see the three factors listed above
as hurdles to getting published quickly. Herculean efforts have to be made to
reproduce studies that are not reproducible, and more often than not, people
that try to reproduce the results are unsuccessful. Thankfully, things are
changing and there are more and more efforts being made to make
research reproducible by design, and not as an afterthought. In the private
sector, tight deadlines lead to the same problem: analysts think that making the
project reproducible is an hindrance to being able to deliver on time. But here
as well, taking the time to make the project reproducible will help with making
sure that what is delivered is of high quality, and it will also help with
making reusing existing code for future projects much easier, even further
accelerating development.
Data analysis, at whatever level and for whatever field, is not just about
getting to a result, the way to get to the result is part of it! This is true
for science, this is true for industry, this is true everywhere. You get to
decide where on the iceberg of reproducibility you want to settle, but the
lower, the better.
So why build RAPs? Well, because there’s no alternative if you want to perform
your work seriously.