generated from UtrechtUniversity/simple-r-project
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathPart_1_Overview_and_missing_data.qmd
230 lines (176 loc) · 7.48 KB
/
Part_1_Overview_and_missing_data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: "Preoperative Atelectasis"
subtitle: "Part 1: Overview, selection criteria, and missing data"
author: "Javier Mancilla Galindo"
date: "`r Sys.Date()`"
execute:
echo: true
warning: false
format:
html:
toc: true
toc_float: true
embed-resources: true
pdf:
documentclass: scrartcl
editor: visual
---
# Setup
#### Packages used
```{r}
if (!require("pacman", quietly = TRUE)) {
install.packages("pacman")
}
pacman::p_load(
tidyverse, # Used for basic data handling and visualization.
dataverse, # Retrieve dataset from the Harvard dataverse.
overviewR, # Used to assess missing data.
table1, #Used to add labels to variables.
report #Used to cite packages used in this session.
)
```
```{r}
#| include: false
# Create directories for sub-folders
psfolder <- "../data/processed"
reports <- "../docs/reports"
dir.create(psfolder, showWarnings = FALSE)
dir.create(reports, showWarnings = FALSE)
```
##### Session and package dependencies
```{r}
#| echo: false
# Credits chunk of code: Alex Bossers, Utrecht University ([email protected])
# remove clutter
session <- sessionInfo()
session$BLAS <- NULL
session$LAPACK <- NULL
session$loadedOnly <- NULL
# write log file
writeLines(
capture.output(print(session, locale = FALSE)),
paste0("sessions/",lubridate::today(), "_session_Part_1.txt")
)
session
```
```{r}
#| include: false
# Load dataset
data <- get_dataframe_by_name(
filename = "atelectasis_prevalence.tab",
dataset = "10.7910/DVN/4JZZLB",
server = "dataverse.harvard.edu")
# Recode variables
source("scripts/variable_names_raw.R", local = knitr::knit_global())
```
```{r}
#| include: false
# Can be used to check the structure of the dataset after prior specification.
str(data)
```
# General overview
```{r}
summary(data)
```
### Exclude participants with CO-RADS ≥ 3:
Participants with higher probability of having a current diagnosis of COVID-19 are expected to have chest CT alterations due to COVID-19 pneumonia. Thus, will be excluded.
Number of patients according to CO-RADS:
```{r}
#| echo: false
summary(data$CORADS)
```
Number of participants in the dataset (to keep track of no. of exclusions).
```{r}
count(data)
```
```{r}
data <- data %>%
filter(as.numeric(CORADS) < 3) %>%
droplevels(data$CORADS)
```
```{r}
count(data)
```
### Exclude participants with prior COVID-19:
Since prior COVID-19 is considered a confounder and since there are only 3 participants with prior COVID-19 which would provide difficult to assess the role of prior COVID-19 in analyses, participants with prior COVID-19 were excluded from the analysis.
```{r}
count(data)
```
```{r}
data <- data %>%
filter(prior_covid19 == "No") %>%
droplevels(data$prior_covid19)
```
```{r}
count(data)
```
Will remove prior_covid19 column as it no longer provides information of a varying characteristic. Similarly, rapid_covid_test does not provide additional information.
```{r}
length(data)
```
```{r}
data <- data %>% select(-c(prior_covid19, rapid_covid_test))
```
```{r}
length(data)
```
# Missing data per variable
```{r}
#| echo: false
overview_na(data)
```
Missing data for PCR_covid and is explained since only patients who decided to have a test performed on their own will reported the result. The medical center did not require a negative PCR test at that time during the pandemic, reason why PCR tests were not systematically performed. As shown in earlier summary of variables, all available tests (n=3) were negative. This variable will not be analyzed further downstream:
```{r}
data <- data %>% select(-PCR_covid)
```
The variable atelectasis_location has missing data since those patients who did not have atelectasis naturally do not have a location recorded. Will assess if data are missing for those who had atelectasis:
```{r}
#| echo: false
data %>%
filter(atelectasis == "Yes") %>%
group_by(atelectasis_location) %>%
overview_na()
```
There is no missing data for ***atelectasis_location*** after sub-setting only those who had atelectasis.
Lastly, I will subset all participants without the prior variable to further assess the extent of missing data for other variables:
```{r}
#| echo: false
data %>%
select(-c(atelectasis_location)) %>%
overview_na()
```
Missing data constitutes **less than 5%** for all remaining variables. Thus, will proceed with **complete-case analysis** without performing any data imputation procedure.
```{r}
#| include: false
#Save dataset
write.csv(data,
file = paste0(psfolder,"/atelectasis_included.csv"),
row.names=FALSE)
```
# Package references
```{r}
#| include: false
report::cite_packages(session)
```
- Grolemund G, Wickham H (2011). “Dates and Times Made Easy with lubridate.” *Journal of Statistical Software*, *40*(3), 1-25. <https://www.jstatsoft.org/v40/i03/>.
- Kuriwaki S, Beasley W, Leeper TJ (2023). *dataverse: R Client for Dataverse 4+ Repositories*. R package version 0.3.13.
- Makowski D, Lüdecke D, Patil I, Thériault R, Ben-Shachar M, Wiernik B (2023). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” *CRAN*. <https://easystats.github.io/report/>.
- Meyer C, Hammerschmidt D (2023). *overviewR: Easily Extracting Information About Your Data*. R package version 0.0.13, <https://CRAN.R-project.org/package=overviewR>.
- Müller K, Wickham H (2023). *tibble: Simple Data Frames*. R package version 3.2.1, <https://CRAN.R-project.org/package=tibble>.
- R Core Team (2024). *R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>.
- Rich B (2023). *table1: Tables of Descriptive Statistics in HTML*. R package version 1.4.3, <https://CRAN.R-project.org/package=table1>.
- Rinker TW, Kurkiewicz D (2018). *pacman: Package Management for R*. version 0.5.0, <http://github.com/trinker/pacman>.
- Wickham H (2016). *ggplot2: Elegant Graphics for Data Analysis*. Springer-Verlag New York. ISBN 978-3-319-24277-4, <https://ggplot2.tidyverse.org>.
- Wickham H (2023). *forcats: Tools for Working with Categorical Variables (Factors)*. R package version 1.0.0, <https://CRAN.R-project.org/package=forcats>.
- Wickham H (2023). *stringr: Simple, Consistent Wrappers for Common String Operations*. R package version 1.5.1, <https://CRAN.R-project.org/package=stringr>.
- Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” *Journal of Open Source Software*, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
- Wickham H, François R, Henry L, Müller K, Vaughan D (2023). *dplyr: A Grammar of Data Manipulation*. R package version 1.1.4, <https://CRAN.R-project.org/package=dplyr>.
- Wickham H, Henry L (2023). *purrr: Functional Programming Tools*. R package version 1.0.2, <https://CRAN.R-project.org/package=purrr>.
- Wickham H, Hester J, Bryan J (2024). *readr: Read Rectangular Text Data*. R package version 2.1.5, <https://CRAN.R-project.org/package=readr>.
- Wickham H, Vaughan D, Girlich M (2024). *tidyr: Tidy Messy Data*. R package version 1.3.1, <https://CRAN.R-project.org/package=tidyr>.
```{r}
#| include: false
# Run this chunk if you wish to clear your environment and unload packages.
rm(data, session, psfolder, reports)
pacman::p_unload(negate = TRUE)
```