-
-
Notifications
You must be signed in to change notification settings - Fork 39
/
30-UseCase-FIFA.Rmd
942 lines (685 loc) · 57.1 KB
/
30-UseCase-FIFA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
# (PART) Use-cases {-}
# FIFA 19 {#UseCaseFIFA}
```{r, echo=FALSE, warning=FALSE}
source("code_snippets/ema_init.R")
```
## Introduction {#FIFAintro}
In the previous chapters, we introduced a range of methods for the exploration of predictive models. Different methods were discussed in separate chapters, and while illustrated, they were not directly compared. Thus, in this chapter, we apply the methods to one dataset in order to present their relative merits. In particular, we present an example of a full process of a model development along the lines introduced in Chapter \@ref(modelDevelopmentProcess). This will allow us to show how one can combine results from different methods.\index{dataset ! FIFA}
<!----
The main goal of this chapter is to show how different techniques complement each other. Some phases, like data preparation, are simplified in order to leave space for the method for visual exploration and explanation of predictive models.
---->
The Fédération Internationale de Football Association (FIFA) is a governing body of football (sometimes, especially in the USA, called soccer). FIFA is also a series of video games developed by EA Sports which faithfully reproduces the characteristics of real players. FIFA ratings of football players from the video game can be found at `https://sofifa.com/`. Data from this website for 2019 were scrapped and made available at the Kaggle webpage `https://www.kaggle.com/karangadiya/fifa19`.
We will use the data to build a predictive model for the evaluation of a player's value. Subsequently, we will use the model exploration and explanation methods to better understand the model's performance, as well as which variables and how to influence a player's value.
## Data preparation {#FIFAdataprep}
```{r warning=FALSE, message=FALSE, echo=FALSE}
set.seed(1313)
library("ggmosaic")
library("ggplot2")
library("DALEX")
library("patchwork")
library("scales")
euro_format <- function(largest_with_cents = 100000) {
function(x) {
x <- round(x, 0.01)
if (max(x, na.rm = TRUE) < largest_with_cents &
!all(x == floor(x), na.rm = TRUE)) {
nsmall <- 2L
} else {
x <- round(x, 1)
nsmall <- 0L
}
str_c("€", format(x, nsmall = nsmall, trim = TRUE, big.mark = ",", scientific = FALSE, digits=1L))
}
}
load("misc/fifa19small.rda")
rownames(fifa19small) <- fifa19small$Name
colnames(fifa19small)[9] <- "Reputation"
```
The original dataset contains 89 variables that describe 16,924 players. The variables include information such as age, nationality, club, wage, etc. In what follows, we focus on 45 variables that are included in data frame `fifa` included in the `DALEX` package for R and Python. The variables from this dataset set are listed in Table \@ref(tab:FIFAvariables).
Table: (\#tab:FIFAvariables) Variables in the FIFA 19 dataset.
```{r FIFAvariables, warning=FALSE, message=FALSE, echo=FALSE}
kableExtra::kable(matrix(colnames(fifa19small), ncol = 5), format = "simple")
```
In particular, variable `Value.EUR` contains the player's value in millions of EUR. This will be our dependent variable.
The distribution of the variable is heavily skewed to the right. In particular, the quartiles are equal to 325,000 EUR, 725,000 EUR, and 2,534,478 EUR. There are three players with a value higher than 100 millions of Euro.
```{r warning=FALSE, message=FALSE, echo=FALSE, eval=FALSE}
summary(fifa19small$Value.EUR)
(subset(fifa19small,fifa19small$Value.EUR>100000000)$Name)
```
Thus, in our analyses, we will consider a logarithmically-transformed players' value. Figure \@ref(fig:distFIFA19Value) presents the empirical cumulative-distribution function and histogram for the transformed value. They indicate that the transformation makes the distribution less skewed.
(ref:distFIFA19ValueDesc) The empirical cumulative-distribution function and histogram for the log$_{10}$-transformed players' values.
```{r distFIFA19Value, warning=FALSE, message=FALSE, echo=FALSE, fig.width=9, fig.height=4.5, fig.cap='(ref:distFIFA19ValueDesc)', out.width = '90%', fig.align='center'}
library("scales")
pl1 <- ggplot(fifa19small, aes(Value.EUR)) +
stat_ecdf(geom = "step", pad = FALSE) +
theme_drwhy() +
scale_x_continuous("Estimated value in Euro", trans = "log10", labels = dollar_format(suffix = "€", prefix = "")) +
ggtitle("ECDF for players' value","") +
scale_y_continuous("Fraction of players with value higher than x", label = scales::percent) + theme_ema
pl2 <- ggplot(fifa19small, aes(Value.EUR)) +
geom_histogram(bins = 50) +
theme_drwhy() +
scale_x_continuous("Value in Euro", trans = "log10", labels = dollar_format(suffix = "€", prefix = "")) +
ylab("Number of players with given value") +
ggtitle("Histogram for players' value","") + theme_ema
pl1 + pl2
```
Additionally, we take a closer look at four characteristics that will be considered as explanatory variables later in this chapter. These are: `Age`, `Reactions` (a movement skill), `BallControl` (a general skill), and `Dribbling` (a general skill).
Figure \@ref(fig:distFIFA19histograms) presents histograms of the values of the four variables. From the plot for `Age` we can conclude that most of the players are between 20 and 30 years of age (median age: 25). Variable `Reactions` has an approximately symmetric distribution, with quartiles equal to 56, 62, and 68. Histograms of `BallControl` and `Dribbling` indicate, interestingly, bimodal distributions. The smaller modes are due to goalkeepers.
(ref:distFIFA19histogramsDesc) Histograms for selected characteristics of players.
```{r distFIFA19histograms, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=6.5, fig.cap='(ref:distFIFA19histogramsDesc)', out.width = '90%', fig.align='center'}
fifa19small4 <- fifa19small[,c("Age", "Reactions", "BallControl", "Dribbling")]
library("tidyr")
fifa19small4long <- gather(fifa19small4, variable, value)
ggplot(fifa19small4long, aes(value)) +
geom_histogram() +
theme_drwhy() + facet_wrap(~variable, ncol = 2, scales = "free") + ggtitle("Histograms for players' characteristics","") + scale_x_continuous("") + theme_ema
```
### Code snippets for R
The subset of 5000 most valuable players from the FIFA 19 data is available in the `fifa` data frame in the `DALEX` package.
```{r, eval=FALSE}
library("DALEX")
head(fifa)
```
### Code snippets for Python
The subset of 5000 most valuable players from FIFA 19 data can be loaded to Python with `dalex.datasets.load_fifa()` method.
```{python, eval=FALSE}
import dalex as dx
fifa = dx.datasets.load_fifa()
```
## Data understanding {#FIFAdataunderst}
We will investigate the relationship between the four selected characteristics and the (logarithmically-transformed) player's value. Toward this aim, we use the scatter plots shown in Figure \@ref(fig:distFIFA19scatter). Each plot includes a smoothed curve capturing the trend.
For `Age`, the relationship is not monotonic. There seems to be an optimal age, between 25 and 30 years, at which the player's value reaches the maximum. On the other hand, the value of youngest and oldest players is about 10 times lower, as compared to the maximum.
For variables `BallControl` and `Dribbling`, the relationship is not monotonic. In general, the larger value of these coefficients, the large value of a player. However, there are "local" maxima for players with low scores for `BallControl` and `Dribbling`. As it was suggested earlier, these are probably goalkeepers.
For `Reactions`, the association with the player's value is monotonic, with increasing values of the variable leading to increasing values of players.
(ref:distFIFA19scatterDesc) Scatter plots illustrating the relationship between the (logarithmically-transformed) player's value and selected characteristics.
```{r distFIFA19scatter, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=6.5, fig.cap='(ref:distFIFA19scatterDesc)', out.width = '90%', fig.align='center'}
fifa19small4v <- fifa19small[,c("Value.EUR","Age", "Reactions", "BallControl", "Dribbling")]
fifa19small4long <- gather(fifa19small4v, variable, value, -Value.EUR)
ggplot(fifa19small4long, aes(value, Value.EUR)) +
geom_point() + geom_smooth(size = 2, se = FALSE) +
theme_drwhy() +
facet_wrap(~variable, ncol = 2, scales = "free") +
scale_y_continuous("Value in Euro", trans = "log10", labels = dollar_format(suffix = "€", prefix = "")) +
scale_x_continuous("") +
ggtitle("Scatterplots for players' characteristics","") + theme_ema
```
Figure \@ref(fig:distFIFA19scatter2) presents the scatter-plot matrix for the four selected variables. It indicates that all variables are positively correlated, though with different strength. In particular, `BallControl` and `Dribbling` are strongly correlated, with the estimated correlation coefficient larger than 0.9. `Reactions` is moderately correlated with the other three variables. Finally, there is a moderate correlation between `Age` and `Reactions`, but not much correlation with `BallControl` and `Dribbling`.
(ref:distFIFA19scatter2Desc) Scatter-plot matrix illustrating the relationship between selected characteristics of players.
```{r distFIFA19scatter2, warning=FALSE, message=FALSE, echo=FALSE, fig.width=10, fig.height=9, fig.cap='(ref:distFIFA19scatter2Desc)', out.width = '90%', fig.align='center'}
library("GGally")
ggpairs(fifa19small4v[,-1],
diag = list(continuous = "barDiag")) +
theme_drwhy() +
ggtitle("Scatterplot matrix for players' characteristics","") + theme_ema
```
## Model assembly {#FIFAmodelassembly}
In this section, we develop a model for players' values. We consider all variables other than `Name`, `Club`, `Position`, `Value.EUR`, `Overall`, and `Special` (see Section \@ref(FIFAdataprep)) as explanatory variables. The base-10 logarithm of the player's value is the dependent variable. <!--The data to be analyzed are stored in data frame `fifa19small_red`, as indicated in the code below. -->
```{r, warning=FALSE, message=FALSE, echo=FALSE}
# log10 transformation
fifa19small <- fifa19small[fifa19small$Value.EUR > 1, ]
fifa19small$LogValue <- log10(fifa19small$Value.EUR)
fifa19small_red <- fifa19small[,-c(1, 2, 3, 4, 6, 7)]
```
Given different possible forms of relationship between the (logarithmically-transformed) player's value and explanatory variables (as seen, for example, in Figure \@ref(fig:distFIFA19scatter)), we build four different, flexible models to check whether they are capable of capturing the various relationships. In particular, we consider the following models:
- a boosting model with 250 trees of 1-level depth, as implemented in package `gbm` [@gbm],
- a boosting model with 250 trees of 4-levels depth (this model should be able to catch interactions between variables),\index{package | gbm}
- a random forest model with 250 trees, as implemented in package `ranger` [@rangerRpackage],
- a linear model with a spline-transformation of explanatory variables, as implemented in package `rms` [@rms].
These models will be explored in detail in the following sections.
### Code snippets for R
In this section, we show R-code snippets used to develop the gradient boosting model. Other models were built in a similar way.
The code below fits the model to the data. The dependent variable `LogValue` contains the base-10 logarithm of `Value.EUR`, i.e., of the player's value.
```{r createModelsEx, warning=FALSE, message=FALSE, eval=FALSE}
fifa$LogValue <- log10(fifa$Value.EUR)
fifa_small <- fifa[,-c(1, 2, 3, 4, 6, 7)]
fifa_gbm_deep <- gbm(LogValue~., data = fifa_small, n.trees = 250,
interaction.depth = 4, distribution = "gaussian")
```
```{r createModels, warning=FALSE, message=FALSE, echo=FALSE}
library("gbm")
fifa_gbm_shallow <- gbm(LogValue~., data = fifa19small_red, n.trees = 250,
interaction.depth = 1, distribution = "gaussian")
fifa_gbm_deep <- gbm(LogValue~., data = fifa19small_red, n.trees = 250,
interaction.depth = 4, distribution = "gaussian")
library("ranger")
fifa_rf <- ranger(LogValue~., data = fifa19small_red, num.trees = 250)
library("rms")
fifa_ols <- ols(LogValue ~ rcs(Age) + rcs(Reputation) +
rcs(Skill.Moves) + rcs(Crossing) + rcs(Finishing) +
rcs(HeadingAccuracy) + rcs(ShortPassing) + rcs(Volleys) +
rcs(Dribbling) + rcs(Curve) + rcs(FKAccuracy) +
rcs(LongPassing) + rcs(BallControl) + rcs(Acceleration) +
rcs(SprintSpeed) + rcs(Agility) + rcs(Reactions) +
rcs(Balance) + rcs(ShotPower) + rcs(Jumping) + rcs(Stamina) +
rcs(Strength) + rcs(LongShots) + rcs(Aggression) +
rcs(Interceptions) + rcs(Positioning) + rcs(Vision) +
rcs(Penalties) + rcs(Composure) + rcs(Marking) +
rcs(StandingTackle) + rcs(SlidingTackle) + rcs(GKDiving) +
rcs(GKHandling) + rcs(GKKicking) + rcs(GKPositioning) +
rcs(GKReflexes), data = fifa19small_red)
```
For model-exploration purposes, we have got to create an explainer-object with the help of the `DALEX::explain()` function (see Section \@ref(ExplainersTitanicRCode)). The code below is used for the gradient boosting model. Note that the model was fitted to the logarithmically-transformed player's value. However, it is more natural to interpret the predictions on the original scale. This is why, in the provided syntax, we apply the `predict_function` argument to specify a user-defined function to obtain predictions on the original scale, in Euro. Additionally, we use the `data` and `y` arguments to indicate the data frame with explanatory variables and the values of the dependent variable, for which predictions are to be obtained. Finally, the model receives its own `label`.
```{r createExplainersEx, message=FALSE, warning=FALSE, results='hide', eval=FALSE}
library("DALEX")
fifa_gbm_exp_deep <- DALEX::explain(fifa_gbm_deep,
data = fifa_small, y = 10^fifa_small$LogValue,
predict_function = function(m,x) 10^predict(m, x, n.trees = 250),
label = "GBM deep")
```
```{r createExplainers, message=FALSE, warning=FALSE, results='hide', echo=FALSE}
library("DALEX")
fifa_gbm_exp_deep <- DALEX::explain(fifa_gbm_deep,
data = fifa19small_red, y = 10^fifa19small_red$LogValue,
predict_function = function(m,x) 10^predict(m, x, n.trees = 250),
label = "GBM deep")
fifa_gbm_exp_shallow <- DALEX::explain(fifa_gbm_shallow,
data = fifa19small_red, y = 10^fifa19small_red$LogValue,
predict_function = function(m,x) 10^predict(m, x, n.trees = 250),
label = "GBM shallow")
fifa_rf_exp <- DALEX::explain(fifa_rf,
data = fifa19small_red, y = 10^fifa19small_red$LogValue,
predict_function = function(m,x) 10^predict(m, x)$predictions,
label = "RF")
fifa_rm_exp <- DALEX::explain(fifa_ols,
data = fifa19small_red, y = 10^fifa19small_red$LogValue,
predict_function = function(m,x) 10^predict(m, x),
label = "RM")
```
### Code snippets for Python
In this section, we show Python-code snippets used to develop the gradient boosting model. Other models were built in a similar way.
The code below fits the model to the data. The dependent variable `ylog` contains the logarithm of `value_eur`, i.e., of the player's value.
```{python, eval=FALSE}
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
import numpy as np
X = fifa.drop(["nationality", "overall", "potential",
"value_eur", "wage_eur"], axis = 1)
y = fifa['value_eur']
ylog = np.log(y)
X_train, X_test, ylog_train, ylog_test, y_train, y_test =
train_test_split(X, ylog, y, test_size = 0.25, random_state = 4)
gbm_model = LGBMRegressor()
gbm_model.fit(X_train, ylog_train, verbose = False)
```
For model-exploration purposes, we have to create the explainer-object with the help of the `Explainer()` constructor from the `dalex` library (see Section \@ref(ExplainersTitanicPythonCode)). The code is provided below. Note that the model was fitted to the logarithmically-transformed player's value. However, it is more natural to interpret the predictions on the original scale. This is why, in the provided syntax, we apply the `predict_function` argument to specify a user-defined function to obtain predictions on the original scale, in Euro. Additionally, we use the `X` and `y` arguments to indicate the data frame with explanatory variables and the values of the dependent variable, for which predictions are to be obtained. Finally, the model receives its own `label`.
```{python, eval=FALSE}
def predict_function(model, data):
return np.exp(model.predict(data))
fifa_gbm_exp = dx.Explainer(gbm_model, X_test, y_test,
predict_function = predict_function, label = 'gbm')
```
## Model audit {#FIFAmodelaudit}
Having developed the four candidate models, we may want to evaluate their performance. Toward this aim, we can use the measures discussed in Section \@ref(modelPerformanceMethodCont). The computed values are presented in Table \@ref(tab:modelPerformanceFIFA). On average, the values of the root-mean-squared-error (RMSE) and mean-absolute-deviation (MAD) are the smallest for the random forest model.
Table: (\#tab:modelPerformanceFIFA) Model-performance measures for the four models for the FIFA 19 data.
```{r modelPerformance, warning=FALSE, message=FALSE, echo=FALSE}
library("DALEX")
fifa_mr_gbm_shallow <- model_performance(fifa_gbm_exp_shallow)
fifa_mr_gbm_deep <- model_performance(fifa_gbm_exp_deep)
fifa_mr_rf <- model_performance(fifa_rf_exp)
fifa_mr_rm <- model_performance(fifa_rm_exp)
perf_mat <- rbind(unlist(fifa_mr_gbm_shallow$measures),
unlist(fifa_mr_gbm_deep$measures),
unlist(fifa_mr_rf$measures),
unlist(fifa_mr_rm$measures))
rownames(perf_mat) <- c("GBM shallow","GBM deep","RF","RM")
colnames(perf_mat) <- c("MSE", "RMSE", "R2", "MAD")
kableExtra::kable(perf_mat, format = "simple")
```
<!----
Figure \@ref(fig:modelPerformanceBoxplot) compares distributions of absolute model residuals. Crosses corresponds to average, which correspond to RMSE. On average, smallest residuals are for the Random Forest model.
(ref:modelPerformanceBoxplotDesc) Distribution of absolute values of residuals. The means are indicated by dots.
---->
```{r modelPerformanceBoxplot, warning=FALSE, message=FALSE, echo=FALSE, eval = FALSE, fig.width=8, fig.height=3, fig.cap='(ref:modelPerformanceBoxplotDesc)', out.width = '90%', fig.align='center'}
plot(fifa_mr_gbm_shallow, fifa_mr_gbm_deep, fifa_mr_rf, fifa_mr_rm, geom = "boxplot") +
scale_y_continuous("Absolute residuals in Euro", trans = "log10", labels = dollar_format(suffix = "€", prefix = "")) +
ggtitle("Distributions of absolute residuals","") + theme_ema
```
In addition to computing measures of the overall performance of the model, we should conduct a more detailed examination of both overall- and instance-specific performance. Toward this aim, we can apply residual diagnostics, as discussed in Chapter \@ref(residualDiagnostic).
```{r modelPerformanceResids, warning=FALSE, message=FALSE, echo=FALSE}
fifa_md_gbm_shallow <- model_diagnostics(fifa_gbm_exp_shallow)
fifa_md_gbm_deep <- model_diagnostics(fifa_gbm_exp_deep)
fifa_md_rf <- model_diagnostics(fifa_rf_exp)
fifa_md_rm <- model_diagnostics(fifa_rm_exp)
```
For instance, we can create a plot comparing the predicted (fitted) and observed values of the dependent variable.
(ref:modelPerformanceScatterplotDesc) Observed and predicted (fitted) players' values for the four models for the FIFA 19 data.
```{r modelPerformanceScatterplot, warning=FALSE, message=FALSE, echo=FALSE, fig.width=10, fig.height=10, fig.cap='(ref:modelPerformanceScatterplotDesc)', out.width = '90%', fig.align='center'}
plot(fifa_md_gbm_shallow, fifa_md_gbm_deep, fifa_md_rf, fifa_md_rm,
variable = "y", yvariable = "y_hat") +
scale_x_continuous("Value in Euro", trans = "log10",
labels = dollar_format(suffix = "€", prefix = "")) +
scale_y_continuous("Predicted value in Euro", trans = "log10",
labels = dollar_format(suffix = "€", prefix = "")) +
facet_wrap(~label) +
geom_abline(slope = 1) +
theme(legend.position = "none") +
ggtitle("Predicted and observed players' values", "") + theme_ema
```
The resulting plot is shown in Figure \@ref(fig:modelPerformanceScatterplot). It indicates that predictions are closest to the observed values of the dependent variable for the random forest model. It is worth noting that the smoothed trend for the model is close to a straight line, but with a slope smaller than 1. This implies the random forest model underestimates the actual value of the most expensive players, while it overestimates the value for the least expensive ones. A similar pattern can be observed for the gradient boosting models. This "shrinking to the mean" is typical for this type of models. \index{Smoothed | trend}
### Code snippets for R
In this section, we show R-code snippets for model audit for the gradient boosting model. For other models a similar syntax was used.
The `model_performance()` function (see Section \@ref(modelPerformanceR)) is used to calculate the values of RMSE, MSE, R$^2$, and MAD for the model.
```{r modelAuditFifaR, warning=FALSE, message=FALSE, echo=TRUE, eval=FALSE}
model_performance(fifa_gbm_exp_deep)
```
The `model_diagnostics()` function (see Section \@ref(RcodeResidualDiagnostic)) is used to create residual-diagnostics plots. Results of this function can be visualised with the generic `plot()` function. In the code that follows, additional arguments are used to improve the outlook and interpretability of both axes.
```{r modelAuditFifaR2, warning=FALSE, message=FALSE, echo=TRUE, eval=FALSE}
fifa_md_gbm_deep <- model_diagnostics(fifa_gbm_exp_deep)
plot(fifa_md_gbm_deep,
variable = "y", yvariable = "y_hat") +
scale_x_continuous("Value in Euro", trans = "log10",
labels = dollar_format(suffix = "€", prefix = "")) +
scale_y_continuous("Predicted value in Euro", trans = "log10",
labels = dollar_format(suffix = "€", prefix = "")) +
geom_abline(slope = 1) +
ggtitle("Predicted and observed players' values", "")
```
### Code snippets for Python
In this section, we show Python-code snippets used to perform residual diagnostic for trained the gradient boosting model. Other models were tested in a similar way.
The `fifa_gbm_exp.model_diagnostics()` function (see Section \@ref(PythoncodeResidualDiagnostic)) is used to calculate the residuals and absolute residuals.
Results of this function can be visualised with the `plot()` function. The code below produce diagnostic plots similar to these presented in Figure \@ref(fig:modelPerformanceScatterplot).
```{python, eval=FALSE}
fifa_md_gbm = fifa_gbm_exp.model_diagnostics()
fifa_md_gbm.plot(variable = "y", yvariable = "y_hat")
```
## Model understanding (dataset-level explanations) {#FIFAmodelunderst}
All four developed models involve many explanatory variables. It is of interest to understand which of the variables exercises the largest influence of models' predictions. Toward this aim, we can apply the permutation-based variable-importance measure discussed in Chapter \@ref(featureImportance). Subsequently, we can construct a plot of the obtained mean (over the default 10 permutations) variable-importance measures. Note that we consider only the top-20 variables. \index{Dataset-level explanation}
```{r featureImportance, warning=FALSE, message=FALSE, echo=FALSE}
fifa_mp_gbm_shallow <- model_parts(fifa_gbm_exp_shallow)
fifa_mp_gbm_deep <- model_parts(fifa_gbm_exp_deep)
fifa_mp_rf <- model_parts(fifa_rf_exp)
fifa_mp_rm <- model_parts(fifa_rm_exp)
```
(ref:featureImportancePlotDesc) Mean variable-importance calculated using 10 permutations for the four models for the FIFA 19 data.
```{r, warning=FALSE, message=FALSE, eval=FALSE, echo=FALSE}
plot(fifa_mp_gbm_shallow, fifa_mp_gbm_deep, fifa_mp_rf, fifa_mp_rm,
max_vars = 20, bar_width = 4, show_boxplots = FALSE) +
ggtitle("Feature Importance", "")
```
```{r featureImportancePlot, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=12, fig.cap='(ref:featureImportancePlotDesc)', out.width = '100%', fig.align='center'}
plot(fifa_mp_gbm_shallow, fifa_mp_gbm_deep, fifa_mp_rf, fifa_mp_rm,
max_vars = 20, bar_width = 4, show_boxplots = FALSE) +
ggtitle("Feature Importance", "") + theme_ema
```
The resulting plot is shown in Figure \@ref(fig:featureImportancePlot). The bar for each explanatory variable starts at the RMSE value of a particular model and ends at the (mean) RMSE calculated for data with permuted values of the variable.
Figure \@ref(fig:featureImportancePlot) indicates that, for the gradient boosting and random forest models, the two explanatory variables with the largest values of the importance measure are `Reactions` or `BallControl`. The importance of other variables varies depending on the model. Interestingly, in the linear-regression model, the highest importance is given to goal-keeping skills.
We may also want to take a look at the partial-dependence (PD) profiles discussed in Chapter \@ref(partialDependenceProfiles). Recall that they illustrate how does the expected value of a model's predictions behave as a function of an explanatory variable. To create the profiles, we apply function `model_profile()` from the `DALEX` package (see Section \@ref(PDPR)). We focus on variables `Reactions`, `BallControl`, and `Dribbling` that were important in the random forest model (see Figure \@ref(fig:featureImportancePlot)). We also consider `Age`, as it had some effect in the gradient boosting models. Subsequently, we can construct a plot of contrastive PD profiles (see Section \@ref(contrastivePDPs)) that is shown in Figure \@ref(fig:usecaseFIFApdPlot).
```{r usecaseFIFApd, warning=FALSE, message=FALSE, echo=FALSE}
selected_variables <- c("Reactions", "BallControl", "Dribbling", "Age")
fifa19_pd_shallow <- model_profile(fifa_gbm_exp_shallow,
variables = selected_variables)$agr_profiles
fifa19_pd_deep <- model_profile(fifa_gbm_exp_deep,
variables = selected_variables)$agr_profiles
fifa19_pd_rf <- model_profile(fifa_rf_exp, variables = selected_variables)$agr_profiles
fifa19_pd_rm <- model_profile(fifa_rm_exp, variables = selected_variables)$agr_profiles
```
(ref:usecaseFIFApdPlotDesc) Contrastive partial-dependence profiles for the four models and selected explanatory variables for the FIFA 19 data.
```{r usecaseFIFApdPlot, warning=FALSE, message=FALSE, echo=FALSE, fig.width=9, fig.height=8, fig.cap='(ref:usecaseFIFApdPlotDesc)', out.width = '90%', fig.align='center'}
plot(fifa19_pd_shallow, fifa19_pd_deep, fifa19_pd_rf, fifa19_pd_rm) +
scale_y_continuous("Predicted value in Euro", trans = "log10",
labels = dollar_format(suffix = "€", prefix = "")) +
ggtitle("Contrastive partial-dependence profiles for selected variables","") + theme_ema
```
Figure \@ref(fig:usecaseFIFApdPlot) indicates that the shape of the PD profiles for `Reactions`, `BallControl`, and `Dribbling` is, in general, similar for all the models and implies an increasing predicted player's value for an increasing (at least, after passing some threshold) value of the explanatory variable. However, for `Age`, the shape is different and suggests a decreasing player's value after the age of about 25 years. It is worth noting that the range of expected model's predictions is, in general, the smallest for the random forest model. Also, the three tree-based models tend to stabilize the predictions at the ends of the explanatory-variable ranges.
The most interesting difference between the conclusions drawn from Figure \@ref(fig:distFIFA19scatter) and those obtained from Figure \@ref(fig:usecaseFIFApdPlot) is observed for variable `Age`. In particular, Figure \@ref(fig:distFIFA19scatter) suggests that the relationship between player's age and value is non-monotonic, while Figure \@ref(fig:usecaseFIFApdPlot) suggests a non-increasing relationship. How can we explain this difference? A possible explanation is as follows. The youngest players have lower values, not because of their age, but because of their lower skills, which are correlated (as seen from the scatter-plot matrix in Figure \@ref(fig:distFIFA19scatter2)) with young age. The simple data exploration analysis, presented in the upper-left panel of Figure \@ref(fig:distFIFA19scatter), cannot separate the effects of age and skills. As a result, the analysis suggests a decrease in player's value for the youngest players. In models, however, the effect of age is estimated while adjusting for the effect of skills. After this adjustment, the effect takes the form of a non-increasing pattern, as shown by the PD profiles for `Age` in Figure \@ref(fig:usecaseFIFApdPlot).
This example indicates that *exploration of models may provide more insight than exploration of raw data*. In exploratory data analysis, the effect of variable `Age` was confounded by the effect of skill-related variables. By using a model, the confounding has been removed.
### Code snippets for R
In this section, we show R-code snippets for dataset-level exploration for the gradient boosting model. For other models a similar syntax was used.
The `model_parts()` function from the `DALEX` package (see Section \@ref(featureImportanceR)) is used to calculate the permutation-based variable-importance measure. The generic `plot()` function is applied to graphically present the computed values of the measure. The `max_vars` argument is used to limit the number of presented variables up to 20.
```{r modelLeveleModelsEx1, warning=FALSE, message=FALSE, eval=FALSE}
fifa_mp_gbm_deep <- model_parts(fifa_gbm_exp_deep)
plot(fifa_mp_gbm_deep, max_vars = 20,
bar_width = 4, show_boxplots = FALSE)
```
The `model_profile()` function from the `DALEX` package (see Section \@ref(PDPR)) is used to calculate PD profiles. The generic `plot()` function is used to graphically present the profiles for selected variables.
```{r modelLeveleModelsEx2, warning=FALSE, message=FALSE, eval=FALSE}
selected_variables <- c("Reactions", "BallControl", "Dribbling", "Age")
fifa19_pd_deep <- model_profile(fifa_gbm_exp_deep,
variables = selected_variables)
plot(fifa19_pd_deep)
```
### Code snippets for Python
In this section, we show Python code snippets for dataset-level exploration for the gradient boosting model. For other models a similar syntax was used.
The `model_parts()` method from the `dalex` library (see Section \@ref(featureImportancePython)) is used to calculate the permutation-based variable-importance measure. The `plot()` method is applied to graphically present the computed values of the measure.
```{python, eval=FALSE}
fifa_mp_gbm = fifa_gbm_exp.model_parts()
fifa_mp_gbm.plot(max_vars = 20)
```
The `model_profile()` method from the `dalex` library (see Section \@ref(PDPPython)) is used to calculate PD profiles. The `plot()` method is used to graphically present the computed profiles.
```{python, eval=FALSE}
fifa_mp_gbm = fifa_gbm_exp.model_profile()
fifa_mp_gbm.plot(variables = ['movement_reactions',
'skill_ball_control', 'skill_dribbling', 'age'])
```
In order to calculated other types of profiles, just change the `type` argument.
```{python, eval=FALSE}
fifa_mp_gbm = fifa_gbm_exp.model_profile(type = 'accumulated')
fifa_mp_gbm.plot(variables = ['movement_reactions',
'skill_ball_control', 'skill_dribbling', 'age'])
```
## Instance-level explanations {#FIFAinstanceunderst}
After evaluation of the models at the dataset-level, we may want to focus on particular instances.
### Robert Lewandowski {#FIFALewy}
As a first example, we take a look at the value of *Robert Lewandowski*, for an obvious reason. Table \@ref(tab:RobertLewandowski) presents his characteristics, as included in the analyzed dataset. Robert Lewandowski is a striker.
Table: (\#tab:RobertLewandowski) Characteristics of Robert Lewandowski.
```{r RobertLewandowski, echo=FALSE}
tmp <- data.frame(variable = colnames(fifa19small_red["R. Lewandowski",]),
value = round(unlist(fifa19small_red["R. Lewandowski",])))
tmp4 <- cbind(tmp[1:10,],
tmp[11:20,],
tmp[21:30,],
tmp[31:40,])
kableExtra::kable(tmp4, format = "simple", row.names = FALSE)
#fifa19small_red["R. Lewandowski",]
```
First, we take a look at variable attributions, discussed in Chapter \@ref(breakDown). Recall that they decompose model's prediction into parts that can be attributed to different explanatory variables. The attributions can be presented in a break-down (BD) plot. For brevity, we only consider the random forest model. The resulting BD plot is shown in Figure \@ref(fig:usecaseFIFAbreakDownPlot).
```{r usecaseFIFAbreakDown, warning=FALSE, message=FALSE, echo=FALSE}
fifa_bd_rf <- variable_attribution(fifa_rf_exp,
new_observation = fifa19small_red["R. Lewandowski",])
```
(ref:usecaseFIFAbreakDowDesc) Break-down plot for Robert Lewandowski for the random forest model.
```{r usecaseFIFAbreakDownPlot, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=4, fig.cap='(ref:usecaseFIFAbreakDowDesc)', out.width = '100%', fig.align='center'}
pl1 <- plot(fifa_bd_rf) +
scale_y_continuous("Predicted value in Euro",
labels = dollar_format(suffix = "€", prefix = "")) +
ggtitle("Break-down plot for Robert Lewandowski","") + theme_ema
pl1
```
<!----
fifa_bd_gbm <- variable_attribution(fifa_gbm_exp_shallow,
new_observation = fifa19small["R. Lewandowski",])
pl1 <- plot(fifa_bd_gbm) +
scale_y_continuous("Predicted value in Euro", labels = dollar_format(suffix = "€", prefix = "")) +
ggtitle("Break-down plot for Robert Lewandowski")
pl1 + pl2
--->
Figure \@ref(fig:usecaseFIFAbreakDownPlot) suggests that the explanatory variables with the largest effect are `Composure`, `Volleys`, `LongShots`, and `Stamina`. However, in Chapter \@ref(breakDown) it was mentioned that variable attributions may depend on the order of explanatory covariates that are used in calculations. Thus, in Chapter \@ref(shapley) we introduced Shapley values, based on the idea of averaging the attributions over many orderings. Figure \@ref(fig:usecaseFIFAshapPlot) presents the means of the Shapley values computed by using 25 random orderings for the random forest model.
```{r usecaseFIFAshap, warning=FALSE, message=FALSE, echo=FALSE}
set.seed(1990)
fifa_shap_rf <- variable_attribution(fifa_rf_exp,
new_observation = fifa19small_red["R. Lewandowski",], type = "shap")
```
(ref:usecaseFIFAshapPlotDesc) Shapley values for Robert Lewandowski for the random forest model.
```{r usecaseFIFAshapPlot, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=4, fig.cap='(ref:usecaseFIFAshapPlotDesc)', out.width = '100%', fig.align='center'}
plot(fifa_shap_rf, show_boxplots = FALSE) +
scale_y_continuous("Estimated value in Euro",
labels = dollar_format(suffix = "€", prefix = "")) +
ggtitle("Shapley values for Robert Lewandowski","") + theme_ema
```
<!---
fifa_pg <- predict_parts(fifa_gbm_exp_shallow, new_observation = fifa19small["R. Lewandowski",],
type = "shap")
plot(fifa_pg, show_boxplots = FALSE) +
scale_y_continuous("Estimated value in Euro", labels = dollar_format(suffix = "€", prefix = "")) +
ggtitle("SHAP values plot for Robert Lewandowski (GBM model)")
--->
Figure \@ref(fig:usecaseFIFAshapPlot) indicates that the five explanatory variables with the largest Shapley values are `BallControl`, `Dribbling`, `Reactions`, `ShortPassing`, and `Positioning`. This makes sense, as Robert Lewandowski is a striker.
In Chapter \@ref(ceterisParibus), we introduced ceteris-paribus (CP) profiles. They capture the effect of a selected explanatory variable in terms of changes in a model's prediction induced by changes in the variable's values. Figure \@ref(fig:usecaseFIFAceterisParibusPlot) presents the profiles for variables `Age`, `Reactions`, `BallControl`, and `Dribbling` for the random forest model.
<!--The profiles can be obtained with the help of function `predict_profiles()` package (see Section \@ref(CPR)). We first have got to compute and store the values of the profiles for the selected variables. In the code below we use argument `variable_splits` to provide the values, at which the profiles are to be computed. -->
<!---
fifa_cp_shallow <- predict_profile(fifa_gbm_exp_shallow,
new_observation = fifa19small_red["R. Lewandowski",], variables = selected_variables,
variable_splits = selected_splits)
fifa_cp_deep <- predict_profile(fifa_gbm_exp_deep,
new_observation = fifa19small_red["R. Lewandowski",], variables = selected_variables,
variable_splits = selected_splits)
fifa_cp_rf <- predict_profile(fifa_rf_exp,
new_observation = fifa19small_red["R. Lewandowski",], variables = selected_variables,
variable_splits = selected_splits)
fifa_cp_rm <- predict_profile(fifa_rm_exp,
new_observation = fifa19small_red["R. Lewandowski",], variables = selected_variables,
variable_splits = selected_splits)
plot(fifa_cp_shallow, fifa_cp_deep, fifa_cp_rf, fifa_cp_rm, color = "_label_", variables = c("Age", "Reactions", "BallControl", "Dribbling")) +
scale_y_continuous("Estimated value in Euro", trans = "log10", labels = dollar_format(suffix = "€", prefix = ""))
--->
```{r usecaseFIFAceterisParibus, warning=FALSE, message=FALSE, echo=FALSE}
selected_splits <- list(Age = seq(15,45,0.1), Reactions = seq(20,100,0.1),
BallControl = seq(20,100,0.1), Dribbling = seq(20,100,0.1))
fifa_cp_rf <- individual_profile(fifa_rf_exp,
new_observation = fifa19small_red["R. Lewandowski",],
variables = selected_variables,
variable_splits = selected_splits)
```
<!--
Subsequently, we plot the profiles with the help of the `plot()`function.
-->
(ref:usecaseFIFAceterisParibusPlotDesc) Ceteris-paribus profiles for Robert Lewandowski for four selected variables and the random forest model.
```{r usecaseFIFAceterisParibusPlot, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=6.5, fig.cap='(ref:usecaseFIFAceterisParibusPlotDesc)', out.width = '90%', fig.align='center'}
plot(fifa_cp_rf, #color = "_label_",
variables = c("Age", "Reactions", "BallControl", "Dribbling")) +
scale_y_continuous("Estimated value in Euro", trans = "log10",
labels = dollar_format(suffix = "€", prefix = "")) +
theme(legend.position = "none") + ggtitle("Ceteris-paribus profile", "") + theme_ema
```
Figure \@ref(fig:usecaseFIFAceterisParibusPlot) suggests that, among the four variables, `BallControl` and `Reactions` lead to the largest changes of predictions for this instance. For all four variables, the profiles flatten at the left- and right-hand-side edges. The predicted value of Robert Lewandowski reaches or is very close to the maximum for all four profiles. It is interesting to note that, for `Age`, the predicted value is located at the border of the age region at which the profile suggests a sharp drop in player's value.
As it was argued in Chapter \@ref(localDiagnostics), it is worthwhile to check how does the model behave for observations similar to the instance of interest. Towards this aim, we may want to compare the distribution of residuals for "neighbors" of Robert Lewandowski. Figure \@ref(fig:usecaseFIFAceterisParibusNeighboursPlot) presents the histogram of residuals for all data and the 30 neighbors of Robert Lewandowski.
<!-- Function `individual_diagnostics()` from the `DALEX` package provides the necessary functionality. First, we use it to select 30 closest neighbours and conduct the necessary computations for them. -->
```{r usecaseFIFAceterisParibusNeighbours, warning=FALSE, message=FALSE, echo=FALSE}
id_rf <- individual_diagnostics(fifa_rf_exp, fifa19small_red["R. Lewandowski",],
neighbors = 30)
```
(ref:usecaseFIFAceterisParibusNeighboursPlotDesc) Distribution of residuals for the random forest model for all players and for 30 neighbors of Robert Lewandowski.
```{r usecaseFIFAceterisParibusNeighboursPlot, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=4.5, fig.cap='(ref:usecaseFIFAceterisParibusNeighboursPlotDesc)', out.width = '90%', fig.align='center'}
plot(id_rf) + theme_ema
```
Clearly, the neighbors of Robert Lewandowski include some of the most expensive players. Therefore, as compared to the overall distribution, the distribution of residuals for the neighbors, presented in Figure \@ref(fig:usecaseFIFAceterisParibusNeighboursPlot), is skewed to the right, and its mean is larger than the overall mean. Thus, the model underestimates the actual value of the most expensive players. This was also noted based on the plot in the bottom-left panel of Figure \@ref(fig:modelPerformanceScatterplot).
We can also look at the local-stability plot, i.e., the plot that includes CP profiles for the nearest neighbors and the corresponding residuals (see Chapter \@ref(localDiagnostics)). In Figure \@ref(fig:usecaseFIFAceterisParibusNeighboursAgeRFPlot), we present the plot for `Age`.
<!-- Thus, we use argument `variables = "Age"` in the call to function `individual_diagnostics()`. -->
```{r usecaseFIFAceterisParibusNeighboursAgeGBM, warning=FALSE, message=FALSE, echo=FALSE}
id_rf_age <- individual_diagnostics(fifa_rf_exp, fifa19small_red["R. Lewandowski",],
neighbors = 30, variables = "Age")
```
<!--Then we apply function `plot()` to obtain the desired plot.-->
(ref:usecaseFIFAceterisParibusNeighboursAgeRFPlotDesc) Local-stability plot for `Age` for 30 neighbors of Robert Lewandowski and the random forest model.
```{r usecaseFIFAceterisParibusNeighboursAgeRFPlot, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=5, fig.cap='(ref:usecaseFIFAceterisParibusNeighboursAgeRFPlotDesc)', out.width = '90%', fig.align='center'}
plot(id_rf_age) +
scale_y_continuous("Estimated value in Euro", trans = "log10",
labels = dollar_format(suffix = "€", prefix = "")) +
ggtitle("Local-stability plot for Robert Lewandowski", "") + theme_ema
```
The CP profiles in Figure \@ref(fig:usecaseFIFAceterisParibusNeighboursAgeRFPlot) are almost parallel but span quite a wide range of the predicted player's values. Thus, one could conclude that the predictions for the most expensive players are not very stable. Also, the plot includes more positive residuals (indicated in the plot by green vertical intervals) than negative ones (indicated by red vertical intervals). This confirms the conclusion drawn from Figure \@ref(fig:usecaseFIFAceterisParibusNeighboursPlot) that the values of the most expensive players are underestimated by the model.
### Code snippets for R
In this section, we show R-code snippets for instance-level exploration for the gradient boosting model. For other models, a similar syntax was used.
The `predict_parts()` function from the `DALEX` package (see Chapters \@ref(breakDown)-\@ref(shapley)) is used to calculate variable attributions. Note that we apply the `type = "break_down"` argument to prepare BD plots. The generic `plot()` function is used to graphically present the plots.
```{r instanceLeveleModelsEx1, warning=FALSE, message=FALSE, eval=FALSE}
fifa_bd_gbm <- predict_parts(fifa_gbm_exp,
new_observation = fifa["R. Lewandowski",],
type = "break_down")
plot(fifa_bd_gbm) +
scale_y_continuous("Predicted value in Euro",
labels = dollar_format(suffix = "€", prefix = "")) +
ggtitle("Break-down plot for Robert Lewandowski","")
```
Shapley values are computed by applying the `type = "shap"` argument.
```{r instanceLeveleModelsEx2, warning=FALSE, message=FALSE, eval=FALSE}
fifa_shap_gbm <- predict_parts(fifa_gbm_exp,
new_observation = fifa["R. Lewandowski",],
type = "shap")
plot(fifa_shap_gbm, show_boxplots = FALSE) +
scale_y_continuous("Estimated value in Euro",
labels = dollar_format(suffix = "€", prefix = "")) +
ggtitle("Shapley values for Robert Lewandowski","")
```
The `predict_profile()` function from the `DALEX` package (see Section \@ref(CPR)) is used to calculate the CP profiles. The generic `plot()` function is applied to graphically present the profiles.
```{r modelLeveleModelsEx3, warning=FALSE, message=FALSE, eval=FALSE}
selected_variables <- c("Reactions", "BallControl", "Dribbling", "Age")
fifa_cp_gbm <- predict_profile(fifa_gbm_exp,
new_observation = fifa["R. Lewandowski",],
variables = selected_variables)
plot(fifa_cp_gbm, variables = selected_variables)
```
Finally, the `predict_diagnostics()` function (see Section \@ref(cPLocDiagR)) allows calculating local-stability plots. The generic `plot()` function can be used to plot these profiles for selected variables.
```{r modelLeveleModelsEx4, warning=FALSE, message=FALSE, eval=FALSE}
id_gbm <- predict_diagnostics(fifa_gbm_exp,
fifa["R. Lewandowski",],
neighbors = 30)
plot(id_gbm) +
scale_y_continuous("Estimated value in Euro", trans = "log10",
labels = dollar_format(suffix = "€", prefix = ""))
```
### Code snippets for Python
In this section, we show Python-code snippets for instance-level exploration for the gradient boosting model. For other models, a similar syntax was used.
First, we need to select instance of interest. In this example we will use *Cristiano Ronaldo*.
```{python, eval=FALSE}
cr7 = X.loc['Cristiano Ronaldo',]
```
The `predict_parts()` method from the `dalex` library (see Sections \@ref(BDPython) and \@ref(SHAPPythonCode)) can be used to calculate calculate variable attributions. The `plot()` method with `max_vars` argument is applied to graphically present the corresponding BD plot for up to 20 variables.
```{python, eval=FALSE}
fifa_pp_gbm = fifa_gbm_exp.predict_parts(cr7, type='break_down')
fifa_pp_gbm.plot(max_vars = 20)
```
To calculate Shapley values, the `predict_parts()` method should be applied with the `type='shap'` argument (see Section \@ref(SHAPPythonCode)).
The `predict_profile()` method from the `dalex` library (see Section \@ref(CPPython) allows calculation of the CP profiles. The `plot()` method with the `variables` argument plots the profiles for selected variables.
```{python, eval=FALSE}
fifa_mp_gbm = fifa_gbm_exp.predict_profile(cr7)
fifa_mp_gbm.plot(variables = ['movement_reactions',
'skill_ball_control', 'skill_dribbling', 'age'])
```
### CR7 {#FIFACR7}
As a second example, we present explanations for the random forest-model's prediction for *Cristiano Ronaldo* (CR7). Table \@ref(tab:CR7) presents his characteristics, as included in the analyzed dataset. Note that Cristiano Ronaldo, as Robert Lewandowski, is also a striker. It might be thus of interest to compare the characteristics contributing to the model's predictions for the two players.
Table: (\#tab:CR7) Characteristics of Cristiano Ronaldo.
```{r CR7, echo=FALSE}
tmp <- data.frame(variable = colnames(fifa19small_red["Cristiano Ronaldo",]),
value = round(unlist(fifa19small_red["Cristiano Ronaldo",])))
tmp4 <- cbind(tmp[1:10,],
tmp[11:20,],
tmp[21:30,],
tmp[31:40,])
kableExtra::kable(tmp4, format = "simple", row.names = FALSE)
#fifa19small_red["Cristiano Ronaldo",]
```
The BD plot for Cristiano Ronaldo is presented in Figure \@ref(fig:usecaseFIFAbreakDownCR7Plot). It suggests that the explanatory variables with the largest effect are `ShotPower`, `LongShots`, `Volleys`, and `Vision`.
(ref:usecaseFIFAbreakDownCR7PlotDesc) Break-down plot for Cristiano Ronaldo for the random forest model.
```{r usecaseFIFAbreakDownCR7Plot, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=5, fig.cap='(ref:usecaseFIFAbreakDownCR7PlotDesc)', out.width = '100%', fig.align='center'}
fifa_bd_rf_cr <- variable_attribution(fifa_rf_exp,
new_observation = fifa19small_red["Cristiano Ronaldo",])
plot(fifa_bd_rf_cr) +
scale_y_continuous("Estimated value in Euro", labels=dollar_format(suffix="€",prefix="")) +
ggtitle("Break-down plot for Cristiano Ronaldo", "") + theme_ema
```
Figure \@ref(fig:usecaseFIFAshapCR7Plot) presents Shapley values for Cristiano Ronaldo. It indicates that the four explanatory variables with the largest values are `Reactions`, `Dribbling`, `BallControl`, and `ShortPassing`. These are the same variables as for Robert Lewandowski, though in a different order. Interestingly, the plot for Cristiano Ronaldo includes variable `Age`, for which Shapley value is negative. It suggests that CR7's age has got a negative effect on the model's prediction.
(ref:usecaseFIFAshapCR7PlotDesc) Shapley values for Cristiano Ronaldo for the random forest model.
```{r usecaseFIFAshapCR7Plot, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=5, fig.cap='(ref:usecaseFIFAshapCR7PlotDesc)', out.width = '100%', fig.align='center'}
set.seed(1965)
fifa_shap_rf_cr <- variable_attribution(fifa_rf_exp,
new_observation = fifa19small_red["Cristiano Ronaldo",],
type = "shap",)
plot(fifa_shap_rf_cr, show_boxplots = FALSE) +
scale_y_continuous("Estimated value in Euro",
labels = dollar_format(suffix = "€", prefix = "")) +
ggtitle("Shapley values for Cristiano Ronaldo","") + theme_ema
```
Finally, Figure \@ref(fig:usecaseFIFAceterisParibusCR7Plot) presents CP profiles for `Age`, `Reactions`, `Dribbling`, and `BallControl`.
(ref:usecaseFIFAceterisParibusCR7PlotDesc) Ceteris-paribus profiles for Cristiano Ronaldo for four selected variables and the random forest model.
```{r usecaseFIFAceterisParibusCR7Plot, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=6.5, fig.cap='(ref:usecaseFIFAceterisParibusCR7PlotDesc)', out.width = '90%', fig.align='center'}
selected_splits <- list(Age = seq(15,45,0.1), Reactions = seq(20,100,0.1),
BallControl = seq(20,100,0.1), Dribbling = seq(20,100,0.1))
fifa_cp_rf <- individual_profile(fifa_rf_exp,
new_observation = fifa19small_red["Cristiano Ronaldo",],
variables = selected_variables,
variable_splits = selected_splits)
plot(fifa_cp_rf, variables = c("Age", "Reactions", "BallControl", "Dribbling")) +
scale_y_continuous("Estimated value in Euro", trans = "log10",
labels = dollar_format(suffix = "€", prefix = "")) + theme_ema +
ggtitle("Ceteris-paribus profile", "")
```
The profiles are similar to those presented in Figure \@ref(fig:usecaseFIFAceterisParibusPlot) for Robert Lewandowski. An interesting difference is that, for `Age`, the predicted value for Cristiano Ronaldo is located within the region of age, linked with a sharp drop in player's value. This is in accordance with the observation, made based on Figure \@ref(fig:usecaseFIFAshapCR7Plot), that CR7's age has got a negative effect on the model's prediction.
### Wojciech Szczęsny {#FIFASzczesny}
One might be interested in the characteristics influencing the random forest model's predictions for players other than strikers. To address the question, we present explanations for *Wojciech Szczęsny*, a goalkeeper. Table \@ref(tab:WS) presents his characteristics, as included in the analyzed dataset.
Table: (\#tab:WS) Characteristics of Wojciech Szczęsny.
```{r WS, echo=FALSE}
tmp <- data.frame(variable = colnames(fifa19small_red["W. Szczęsny",]),
value = round(unlist(fifa19small_red["W. Szczęsny",])))
tmp4 <- cbind(tmp[1:10,],
tmp[11:20,],
tmp[21:30,],
tmp[31:40,])
kableExtra::kable(tmp4, format = "simple", row.names = FALSE)
#fifa19small_red["W. Szczęsny",]
```
Figure \@ref(fig:usecaseFIFAbreakDownWS) shows the BD plot. We can see that the most important contributions come from the explanatory variables related to goalkeeping skills like `GKPositioning`, `GKHandling`, and `GKReflexes`. Interestingly, field-player skills like `BallControl` or `Dribbling` have a negative effect.
(ref:usecaseFIFAbreakDownWSDesc) Break-down plot for Wojciech Szczęsny for the random forest model.
```{r usecaseFIFAbreakDownWS, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=5, fig.cap='(ref:usecaseFIFAbreakDownWSDesc)', out.width = '100%', fig.align='center'}
fifa_bd_rf_ws <- variable_attribution(fifa_rf_exp,
new_observation = fifa19small_red["W. Szczęsny",])
plot(fifa_bd_rf_ws) +
scale_y_continuous("Estimated value in Euro", labels=dollar_format(suffix="€", prefix="")) +
ggtitle("Break-down plot for Wojciech Szczęsny","") + theme_ema
```
Figure \@ref(fig:usecaseFIFAshapWS) presents Shapley values (over 25 random orderings of explanatory variables). The plot confirms that the most important contributions to the prediction for Wojciech Szczęsny are due to goalkeeping skills like `GKDiving`, `GKPositioning`, `GKReflexes`, and `GKHandling`. Interestingly, `Reactions` is also important, as it was the case for Robert Lewandowski (see Figure \@ref(fig:usecaseFIFAshapPlot)) and Cristiano Ronaldo (see Figure \@ref(fig:usecaseFIFAshapCR7Plot)).
(ref:usecaseFIFAshapWSDesc) Shapley values for Wojciech Szczęsny for the random forest model.
```{r usecaseFIFAshapWS, warning=FALSE, message=FALSE, echo=FALSE, fig.width=8, fig.height=5, fig.cap='(ref:usecaseFIFAshapWSDesc)', out.width = '100%', fig.align='center'}
set.seed(1994)
fifa_shap_rf_ws <- variable_attribution(fifa_rf_exp,
new_observation = fifa19small_red["W. Szczęsny",],
type = "shap",)
plot(fifa_shap_rf_ws, show_boxplots = FALSE) +
scale_y_continuous("Estimated value in Euro", labels=dollar_format(suffix="€", prefix="")) +
ggtitle("Shapley values for Wojciech Szczęsny","") + theme_ema
```
<!---
{r usecaseFIFAbreakDownWSPlot, warning=FALSE, message=FALSE, echo=TRUE, eval = FALSE, fig.width=12, fig.height=6, fig.cap='(ref:usecaseFIFAbreakDownWSPlotDesc)', out.width = '100%', fig.align='center'}
pl1 <- plot(fifa_bd_gbm) +
scale_y_continuous("Estimated value in Euro", labels=dollar_format(suffix="€",prefix="")) +
ggtitle("Break Down plot for Wojciech Szczęsny (GBM model)")
pl2 <- plot(fifa_bd_rf) +
scale_y_continuous("Estimated value in Euro", labels=dollar_format(suffix="€", prefix="")) +
ggtitle("Break Down plot for Wojciech Szczęsny (RF model)")
pl1 + pl2
--->
### Lionel Messi {#FIFAMessi}
This instance might be THE choice for some of the readers. However, we have decided to leave explanation of the models' predictions in this case as an exercise to the interested readers.
```{r warning=FALSE, message=FALSE, echo=FALSE, eval=FALSE}
# examples for DALEX blogpost
set.seed(1313)
library("ggplot2")
library("DALEX")
load("misc/fifa19small.rda")
rownames(fifa19small) <- fifa19small$Name
# log10 transfromation
fifa19small <- fifa19small[fifa19small$Value.EUR > 1, ]
fifa19small$LogValue <- log10(fifa19small$Value.EUR)
fifa19small <- fifa19small[,-c(1, 2, 3, 4, 6)]
library("gbm")
fifa_gbm_shallow <- gbm(LogValue~., data = fifa19small, n.trees = 250, interaction.depth = 1, distribution = "gaussian")
fifa_gbm <- gbm(LogValue~.,
data = fifa19small,
n.trees = 250,
interaction.depth = 4,
distribution = "gaussian")
library("DALEX")
fifa_exp <- DALEX::explain(fifa_gbm,
data = fifa19small,
y = 10^fifa19small$LogValue,
predict_function = function(m,x) 10^predict(m, x, n.trees = 250),
label = "GBM deep")
cr17 <- fifa19small["Cristiano Ronaldo",]
predict(fifa_exp, cr17)
euro_format <- function(largest_with_cents = 100000) {
function(x) {
x <- round(x, 0.01)
if (max(x, na.rm = TRUE) < largest_with_cents &
!all(x == floor(x), na.rm = TRUE)) {
nsmall <- 2L
} else {
x <- round(x, 1)
nsmall <- 0L
}
str_c("€", format(x, nsmall = nsmall, trim = TRUE, big.mark = ",", scientific = FALSE, digits=1L))
}
}
library(dplyr)
library(scales)
predict_parts(fifa_exp, cr17) %>% plot()
predict_parts(fifa_exp, cr17) %>% plot() + ggtitle("Break Down for CR7") + scale_y_continuous("",labels = dollar_format(suffix = "€", prefix = ""), limits=c(0,55000000))
predict_parts(fifa_exp, cr17, type = "shap") %>% plot(show_boxplots = FALSE) -> a
a + ggtitle("Shapley for CR7") + scale_y_continuous("",labels = dollar_format(suffix = "€", prefix = ""), limits=c(-16000000,16000000))
predict_profile(fifa_exp, cr17) %>% plot(variables = c("Age", "BallControl")) + ggtitle("Ceteris Paribus Profiles for CR7") + scale_y_continuous("",labels = dollar_format(suffix = "€", prefix = ""))
plot(a, show_boxplots = FALSE) + coord_flip(xlim=c(31,41)) + ggtitle("Shapley for CR17") + theme_ema
predict_diagnostics(fifa_exp, cr17) %>% plot() + scale_x_continuous("",labels = dollar_format(suffix = "€", prefix = ""))
model_performance(fifa_exp) %>% plot()
model_parts(fifa_exp) %>% plot(show_boxplots = FALSE, max_vars = 10)
model_profile(fifa_exp) %>% plot(variables = c("Age", "BallControl"), geom = "profiles") + ggtitle("Partial Dependence Profiles for GBM model")
model_profile(fifa_exp) %>% DALEX:::plot.model_profile_profiles(variables = c("Age", "BallControl")) + scale_y_continuous("",labels = dollar_format(suffix = "€", prefix = ""))+ ggtitle("Partial Dependence Profiles for GBM model") + theme_ema
```