-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathC5.0_OneR_rPart_RandomForest.Rmd
263 lines (178 loc) · 12.2 KB
/
C5.0_OneR_rPart_RandomForest.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---
title: <center>Decision Tree Classification - 4 Algorithms</center>
author: <center>C. Perez</center>
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Project & Data Description
This code applies 4 decision tree algorithms (**C5.0, OneR, rpart, and randomForest**) to a [diabetes dataset](https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset.) offered in the UCI machine learning repository with the aim to correctly classify the presence of diabetes given the presence of other conditions. The goal is to then improve on the models with a business objective in mind, NOT to improve the overall accuracy, and compare. The business objective being that the presence of false negatives in trying to predict the presence of a condition outweighs the overall accuracy of correct classification.
The dataset is made up of 520 observations of 17 features.
## EDA
This section explores the diabetes data to find proportions of the target variable (class), split the data into appropriate training and test sets, and verify the proportions of both sets are representative of the whole.
```{r, include = TRUE, warning = FALSE, fig.align='center', comment="", message=FALSE}
# libraries
library(C50)
library(gmodels)
library(OneR)
library(tidyverse)
library(rpart)
library(rpart.plot)
library(randomForest)
# load data -
# https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset.
diabetes <- read.csv("~/Documents/Machine-Learning-Data/diabetes_data.csv",
header = TRUE, stringsAsFactors = TRUE)
# cleaning
names(diabetes) <- str_to_lower(str_replace_all(names(diabetes),"\\.","_"))
# view the structure and get some proportions and info to start
str(diabetes)
sum(is.na(diabetes))
table(diabetes$class)
prop.table(table(diabetes$class)) # approx. 61% positive 38% negative
# create some random samples (75/25 split)
set.seed(1230)
d_train_indices <- sample(nrow(diabetes), .75*nrow(diabetes))
# create train and test set
d_train <- diabetes[d_train_indices,]
d_test <- diabetes[-d_train_indices,]
# test the sample proportions to identify if it were a good split
prop.table(table(d_train$class))
prop.table(table(d_test$class))
```
## rpart
The following code implements the rpart algorithm by building a model to accurately classify the *class* variable given the other 16 features present in the data. The model is inspected and viewed (rpart.plot package) and then used to classify (predict) the outcome of the test set.
```{r, include = TRUE, warning = FALSE, fig.align='center', comment="",fig.dim=c(8,4)}
# create the model
rpart_model <- rpart(formula = class~., data = d_train, method = "class")
# view model
# summary(rpart_model)
# view plot of model
rpart_model
rpart.plot(rpart_model)
#make a prediction
rpart_predictions <- predict(rpart_model, d_test[,-17], type = "class")
```
The accuracy is important to inspect from multiple perspectives. The first accuracy check provides a score on the match between the test class and the predicted class, and it is fairly good around 88%. Sometimes this accuracy should be compared when the data only has a few entries for one of the target classes, (i.e Positive or Negative). The accuracy for each case is computed and it shows that there is value in having the model as the target variable is not mainly of one type and the resulting accuracy drastically improves as a resul to fhaving the prediction.
###### This was known prior to by viewing the proportion of each class in the target variable. If a proportion is around 95+%, then the additonal accuracy should be checked. For the case where there is a fair split (i.e one target class not too dominant) then the first accuracy check is generally sufficient.
```{r, include = TRUE, warning = FALSE, fig.align='center', comment=""}
# get accuracy
mean(rpart_predictions == d_test$class)
# compare this to predicting every result positive or every result negative
mean(d_test$class == "Positive")
mean(d_test$class == "Negative")
# cross tabulate to identify the types of errors and discuss
CrossTable(d_test$class, rpart_predictions, prop.chisq = FALSE, prop.c = FALSE,
prop.r = FALSE, dnn = c("Actual", "Predicted"))
```
Although the model was pretty accurate, there are a high number of false negatives, as can be seen in the cross-table above, which is dangerous. Improving this model would be reducing the likelihood of a false negative regardless of whether overall accuracy increase or decreases because it can be quite costly to misdiagnose someone this way. As shown below, adding a cost matrix to the rpart function allows for a penalty to be applied to producing false negatives opposed to producing false positives.
```{r, include = TRUE, warning = FALSE, fig.align='center', comment=""}
# How to reduce the likelihood of a false negative?
# make a prediction with a loss matrix penalizing false negatives
rpart_model_improved <- rpart(formula = class~., data = d_train, method = "class",
parms = list(loss=matrix(c(0,1,2,0), byrow=TRUE, nrow=2)))
rpart_predictions_improved <- predict(rpart_model_improved, d_test[,-17], type = "class")
# get new accuracy
mean(rpart_predictions_improved == d_test$class)
# cross tabulate to identify the types of errors and discuss
CrossTable(d_test$class, rpart_predictions_improved, prop.chisq = FALSE, prop.c = FALSE,
prop.r = FALSE, dnn = c("Actual", "Predicted"))
```
The amount of false negatives dropped by roughly 33% and overall accuracy actually increased. The improvement in this case is sufficient in that it reduced the costly errors, however the model can be further improved via trying to eliminate them entirely.
## Random Forests
The randomForest function essentially creates multiple trees using subsets of the data in order to aggregate class output and conducts a vote to choose the class for the target variable based on this vote. The number of features used is generally sqrt(p), so in this case 4. The same process was repeated for the random forest, excluding the secondary accuracy checks, and is provided below.
```{r, include = TRUE, warning = FALSE, fig.align='center', comment="",fig.dim=c(8,4)}
# create the model
rf_model <- randomForest(class~., data = d_train)
# inspect model and error plot of model
rf_model
plot(rf_model, main = str_to_title("error plot for random forest"))
#make a prediction
rf_predictions <- predict(rf_model, d_test[,-17], type = "class")
# get accuracy
mean(rf_predictions == d_test$class)
#compare this to predicting every result positive or every result negative
#mean(rf_predictions == "Positive")
#mean(rf_predictions == "Negative")
# cross tabulate to identify the types of errors and discuss
CrossTable(d_test$class, rf_predictions, prop.chisq = FALSE, prop.c = FALSE,
prop.r = FALSE, dnn = c("Actual", "Predicted"))
```
Although this model is very accurate, there are a few false negatives, as can be seen in the cross-table above, which is still dangerous. So the following makes use of the *cutoff* parameter to reduce the likelihood of false negatives. The results are very good and provides a better solution than the rpart function as we can successfully eliminate the false negatives completely.
```{r, include = TRUE, warning = FALSE, fig.align='center', comment=""}
# How to reduce the likelihood of a false negative?
# make a prediction with a cutoff parameter base don ROC curve
rf_model_improved <- randomForest(formula = class~., data = d_train, type = "class",
cutoff = c(.80,.20))
rf_predictions_improved <- predict(rf_model_improved, d_test[,-17], type = "class")
# get new accuracy
mean(rf_predictions_improved == d_test$class)
# cross tabulate to identify the types of errors and discuss
CrossTable(d_test$class, rf_predictions_improved, prop.chisq = FALSE, prop.c = FALSE,
prop.r = FALSE, dnn = c("Actual", "Predicted"))
```
## 1R (OneR)
The oneR function is technically a rule learner opposed to a decision tree in that it allows existing partitions to be modified, while decision trees cannot. However, this algorithm operates in a tree-like manner and tries to use 1 feature as the deciding feature to predict the class of the target variable. The algorithm isolates the feature that yields the highest accuracy to be the predictor (rule) for unseen data.
The following shows the model, the rule generated by the model, and a diagnostic plot for the model
```{r, include = TRUE, warning = FALSE, fig.align='center', comment="",fig.dim=c(8,4)}
# create a 1R classifer
model_1r <- OneR(class ~ ., data = d_train)
summary(model_1r)
plot(model_1r)
# get predictions and cross-tabulate with actual output
predictions_1r <- predict(model_1r, d_test[,-17])
# get new accuracy
mean(predictions_1r == d_test$class)
# cross tabulate results
CrossTable(d_test$class, predictions_1r, prop.chisq = FALSE, prop.c = FALSE,
prop.r = FALSE, dnn = c("Actual", "Predicted"))
```
Clearly the OneR algorithm is very bad in this case, and this should have been apparent as one feature or condition is generally not enough to make a diabetes determination. Seeing as there is not a way to improve an algorithm that only uses 1 feature to classify future cases, an entirely new algorithm would have to be used. The JRip() function from the RWeka package can be used to incorporate more than 1 feature and improve accuracy if a rule learner is valued over a decision tree.
## C5.0
The C5.0 algorithm is said to be industry standard. This algorithm uses entropy, a measure of set homogeneity, to create partitions. Then partitions that optimize entropy via reducing it in order to increase the similarity of groups are accepted. The following shows the model created, the accuracy, and the overall results.
```{r, include = TRUE, warning = FALSE, fig.align='center', comment="",fig.dim=c(8,4)}
# train a model
c5_model <- C5.0(x = d_train[,-17], y = d_train$class)
c5_model
# view tree decisions
summary(c5_model)
# view plots of subtrees in C5.0 model
# for (i in 0:25){
# plot(c5_model, subtree = i)
# }
# evaluate the model performance
c5_predictions <- predict(c5_model, d_test[,-17])
# get new accuracy
mean(d_test$class == c5_predictions)
# cross tabulate results
CrossTable(d_test$class,c5_predictions, prop.chisq = FALSE, prop.c = FALSE,
prop.r = FALSE, dnn = c("Actual", "Predicted"))
```
The overall accuracy is not bad, however the quantity of false negatives is unacceptable. One way of improving this model is to set the trials parameter in the C5.0 function. This is essentially within function boosting, and the results are below for trials set to 10.
```{r, include = TRUE, warning = FALSE, fig.align='center', comment="",fig.dim=c(8,4)}
# train a model
c5_model <- C5.0(x = d_train[,-17], y = d_train$class, trials =10)
c5_model
# evaluate the model performance
c5_predictions <- predict(c5_model, d_test[,-17])
# get new accuracy
mean(d_test$class == c5_predictions)
# cross tabulate results
CrossTable(d_test$class,c5_predictions, prop.chisq = FALSE, prop.c = FALSE,
prop.r = FALSE, dnn = c("Actual", "Predicted"))
```
The overall accuracy has improved and the number of false negatives has reduced, however the amount of false negatives present is still fairly high for diagnosis. The C5.0 function allows for a cost matrix, similar to the rpart function, and when included it can reduce the quantity of false negatives further as shown below:
The cost matrix should be set to the cost of false negative relative to the cost of a false positive, but here it is set to 4x. The number of false positives then reduces 40% without a major increase/reduction to false positives. Overall accuracy increases as well, so this combination of boosting & costs, improves the model drastically.
```{r, include = TRUE, warning = FALSE, fig.align='center', comment="",fig.dim=c(8,4)}
# train a model
c5_model <- C5.0(x = d_train[,-17], y = d_train$class, trials =10,
costs = matrix(c(0,1,4,0), nrow = 2, byrow = TRUE))
c5_model
# evaluate the model performance
c5_predictions <- predict(c5_model, d_test[,-17])
# get new accuracy
mean(d_test$class == c5_predictions)
# cross tabulate results
CrossTable(d_test$class,c5_predictions, prop.chisq = FALSE, prop.c = FALSE,
prop.r = FALSE, dnn = c("Actual", "Predicted"))