-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDrinkersBodySignal.qmd
289 lines (217 loc) · 11.2 KB
/
DrinkersBodySignal.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
---
title: "Drinkers's signal body"
author: "Damiano Pincolini"
date: 2024/03/01
output:
html_document:
toc: true
toc-depth: 2
toc-expand: true
number_sections: true
---
## PROJECT INTRODUCTION
### PROJECT GOAL
The aim of this project is to investigate the potential relation between a target value (drinker status: "N", "Y") and some predictors based on body signals such as qualitative variables (i.e. smoker status) and quantitative features (i.e. cholesterol).
### LOADING LIBRARIES
The libraries that will be mainly used during this project will be: tidyverse, smartEDA, tidymodels and shapviz.
```{r}
pacman::p_load(readr,
tidyverse,
tidymodels,
DataExplorer,
SmartEDA,
ggcorrplot,
corrplot,
moments,
skimr,
knitr,
DescTools,
stats,
vip,
datawizard,
GGally,
caret,
MASS,
kknn,
rpart.plot,
healthyR.ai,
embed,
cowplot,
factoextra,
LiblineaR,
shapviz)
```
### LOADING DATA
The dataset is taken from this Kaggle's page: <https://www.kaggle.com/datasets/sooyoungher/smoking-drinking-dataset.> After downloading it on the laptop hard disk, it has been imported into R as a data frame called "DataOrigin".
```{r}
#| echo: false
#| output: false
getwd()
setwd('C:/Users/casa/Documents/Damiano/R/Mentoring/4.1. drinkersSmokers_tidymodels/RawData')
getwd()
#| echo: true
DataOrigin <- read_csv("DataKaggle.csv")
```
```{r}
#| echo: false
#| output: false
setwd('C:/Users/casa/Documents/Damiano/R/Mentoring/4.1. drinkersSmokers_tidymodels')
getwd()
```
### DATA CHECK
#### Dataset structure and datatype analysis
First of all, with the only aim of having a very first sight on the dataset structure, I want to explore the dataset using some very useful commands from the smartEDA package.
They do quite a lot of job and provide quite a comphrensive feedback that allows to have an overall glance of what I am going to work on.
```{r}
#| echo: true
DataExp1 <- ExpData(DataOrigin, type=1)
kable(DataExp1)
DataExp2 <- ExpData(DataOrigin, type=2)
kable(DataExp2)
```
At a first sight, we can observe what follows:
-"sex" and "DRK_YN" columns which are character data type.
*All the other variables are numeric.
-"hear_left", "hear_right" are dicotomic numeric features (1=normal, 2=abnormal). They should be represented by a factor.
*"urine_protein" is a numeric ordinal variable to be transformed into a factor (1=-, 2=+/-, 3=+1, 4=+2, 5=+3, 6=+4).
*"SMK_stat_type" is a numeric ordinal variable to be transformed into a factor (1=never, 2=used to smoke but quit, 3=still smoke).
*"DRK_YN" is a dicotomic character features (Y, N) that should be turned into a factor. This is supposed to be the target variable.
*There are visible differences in features' scales.
*There are no missing value. No imputation is thus needed.
#### Conversion to factor
I want to force to factor every feature that come with a numerical or categorical class.
By doing this, I create a new object simply called Data with the same dimension but with
standardized data type. This is supposed to beuseful once we will train some ML models.
```{r}
Data <- DataOrigin|>
mutate(sex=as_factor((sex)),
drkStatus=as_factor((DRK_YN)),
hear_left=as_factor((hear_left)),
hear_right=as_factor(hear_right),
urine_protein=as_factor(urine_protein),
smk_status=as_factor(SMK_stat_type_cd),
.keep="unused")
```
### SAMPLING AND DATA PARTITIONING
In order to speed up the CPU pace and avoid crashing it, I want to reduce on a random base the original data set and work, from this point of the project, only on a new "reduced" object (called DataReduced) I will use immediately for data partitioning purpose.
In the execution of DataSplit command, I want to pay attention to keep the same proportion of drinkers' classes ("Y" and "N") both in the training and in the test set, so I will use the argument "stata".
```{r}
#| warning: false
set.seed(666, sample.kind = "Rounding")
Index<- sample(1:nrow(Data),
size=100000,
replace=FALSE)
DataReduced <- Data[Index, ]
DataSplit <- initial_split(DataReduced,
prop=0.8,
strata=drkStatus)
trainData <- training(DataSplit)
testData <- testing(DataSplit)
```
I have made data partitiong so early in the pipeline because from this moment, I want to use train test only for every activity related to explorative data analysis and model training, recalling test set only during model evaluation phase.
This to avoid any kind of data leakage.
## EDA
### Univariate analysis
I am going to explore every single feature, how it is distributed and how it interacts with target variable.
I will start from numerical predictors and then I'll move on to categorical.
#### Numerical features
I keep on using smartEDA package. The ExpNumStat command gives a pretty clear bird's eye view of how every single variable behaves.
```{r}
DataSmartEDA <- ExpNumStat(trainData,
by="A",
gp=NULL,
Qnt=c(0.25,0.75),
MesofShape=2,
Outlier=TRUE,
round=2)
kable(DataSmartEDA)
```
Is there something interesting I can highlight?
- SGOT_ALT and SGOT_AST have a CV around 100% (102% and 90% respectively), gamma_GTP' CV measures 1.36.
- Several variables have a very high skewness and kurtosis and quite a high number of outliers.
- Generally speaking, based on these statistics, it seems there's quite a clear need for
feature transformation.
At this stage, due to the evidence in the table below, I definitely want to properly detect outliers.
```{r}
DataOutlier <- ExpOutliers(trainData,
varlist = c("gamma_GTP", "SGOT_AST", "SGOT_ALT"),
method="boxplot")$outlier_summary
kable(DataOutlier)
```
Let's move on and look for some pattern related to numerical features distribution
```{r}
ExpNumViz(data=trainData,
target="drkStatus",
type=1,
Page=c(3,2))
```
The numerical features visualization allows us to see that some predictors affect significantly the target value, while others don't. Age, height, weight look like the most relevant to this extent, even if the different scale of each box plot could lead to some kind of misinterpretation.
#### Categorical features
As for numerical variables, let's summarize categorical features with smartEDA package.
Specifically, I'll use the ExpCatStat() command.
```{r}
#|warning: FALSE
DataSmartEda_cat <- ExpCatStat(trainData,
Target="drkStatus",
result="stat",
clim=10,
nlim=5,
bins=10,
Pclass="Y",
plot=FALSE,
top=20,
Round=2)|>
rename(PredPower='Predictive Power')|>
dplyr::filter(PredPower=="Highly Predictive")
kable(DataSmartEda_cat)
```
What can we see from this table?
1. We can see six predictors that seem to have a highly predictive power.
2. Compared to decision tree's most important variable, five out of six are the same.
3. Serum_creatinine is considered amongst the top 6 by decision tree, while in Information Value's ranking it gets a "somewhat predictive"/"moderate" predictive power.
4. On the other hand, gamma_GTP is more considered by Information value while reaches the eight place only in decision tree's important variables ranking.
5. Amongst these six feature only height and weight seems pretty much correlated (0.7). Very, very curiously, height is more collerated than weight to drinking habit (the target value).
6. I will "merge" these two feature into one (body mass index) in order to use the information of both with only one predictor.
Now, is there something we can observe about categorical features distribution?
I can quickly find it out with (again: from smartEDa) ExpCatViz().
```{r}
ExpCatViz(data=trainData,
target="drkStatus",
Page=c(3,2),
Flip=TRUE,
col=c("blue", "violet"))
```
This visualization seems pretty informative:
- sex and smoking habits (smk_status) look correlated to the target variable. Specifically, males tend to smoke more frequently and non smokerks (status 1) are tipically non drinkers, while ex smokers or actual smokers more often use alchool.
- Other categorical variables don't seem to be significantly associated to drinking habits.
### Multivariate analysis
#### Correlation between numerical predictors
```{r}
corMatr <- round(cor(trainData[,-c(1,8,9,18,23,24)], use="complete.obs"), 1)
ggcorrplot(corMatr,
hc.order = TRUE,
type = "lower",
lab = TRUE)
```
Only a few numerical features seem to be significantly mutually correlated:
* tot_chole and LDL_chole (0.9),
* height and weight (0.7),
* SBP and DBP (0.7).
According to this, it's logical to take into account the idea of eliminating one for each couple of correlated features above.
#### Principal Component Analysis
With correlation plot, I have taken a first glance to see if some variable may be considered "useless" because its potential predictive power is still provided by another variable(s) and thus semplify the dataset and reducing the "curse of dimensionality" risk.
Another analysis I can do toward this goal may be Principal Compentent Analysis (PCA); if I can reduce the dataset features in a smaller number of brand new variable (or principal components) and the very first principal components (let's say from two to four) are able to explain the most target variable variance, I could feed machine learning models with quite a light and simple predictors' dataset which, in some cases could be very useful.
I will use factoextra package. Let's see what we can get.
```{r}
dsPca2 <- trainData|>
dplyr::select(where(is.numeric))|>
prcomp(scale=TRUE)
summary(dsPca2)
```
What can I say? To explain the 80% of target variance, I need to take into account 9 principal component (which are a sort of "synthetic features" composed by "blending" togheter the original most relevant features).
I read this result in two ways:
* from 24 to 9 "main" variables, the "saving" in terms of dimensionality is pretty interesting;
* on the other hand, 9 features are still a high number expecially if I (after PCA) I have to cope with dimensions I cannot really handle properly as far as interpretation is concerned.
Of course, it is necessary to keep in mind that PCA only affects numerical predictors.
## Next steps
What's above may be used as a basis for training and testing some ML models.