-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path09-unsupervised-learning.qmd
1152 lines (793 loc) · 32.9 KB
/
09-unsupervised-learning.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
format:
html:
number-depth: 3
css: summary-format.css
---
# Unsupervised Learning
It refers to a set of statistical tools intended for the setting in which we have only a set of features $X_1,X_2, \dots ,X_p$ measured on $n$ observations is to discover interesting things about the measurements like:
- Finding an informative way to visualize and explore the data
- Imputing missing values
- Discovering subgroups among the variables or among the observations
As result, the exercise tends to be **more subjective**.
## Libraries to use
```{r}
#| output: false
# Data manipulation
library(data.table)
library(recipes)
# Data visualization
library(ggplot2)
theme_set(theme_light())
# Model creation
library(h2o)
library(cluster)
library(tidymodels)
library(tidyclust)
# Extracting model information
library(broom)
library(factoextra)
library(dendextend)
library(NbClust)
```
## Principal Components Analysis (PCA)
### Purpose
As the **number of variables increases** checking *two-dimensional scatterplots* **gets less insightful** since they each contain just a small fraction of the total information present in the data set.
For example, if we see correlations between features is easy to create more general categories known as **latent variable** as follow:
- Sandwich
- cheese - mayonnaise
- mayonnaise - bread
- bread - cheese
- bread - lettuce
- Soda
- pepsi - coke
- 7up - coke
- Vegetables
- spinach - broccoli
- peas - potatoes
- peas - carrots
At the end this process can help to:
- Reduce the number of featured need to describe the data
- Remove multicollinearity between features
-
### Mathematical Description
PCA finds a **low-dimensional representation** of a data set that **contains as much variation (information)** as possible. It assumes that all dimensions can be described as a **linear combination** of the $p$ original variables, known as **principal component scores**.
$$
Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p
$$
Where:
- $\phi_1 = (\phi_{11} \phi_{21} \dots \phi_{p1})^T$ represent the loading vector for the first principal component.
- $\sum_{j=1}^p \phi_{j1}^2 = 1$
### Steps to apply
To perform a *principal components analysis* (PCA) need to:
(@) Make any needed transformation to *tidy the data*.
(@) Remove or impute any *missing value*.
(@) Transform or variables to be numeric by using method like *one-hot encoding*.
(@) **Center** and **scale** all the variables in $\mathbf{X}$ to have mean zero and standard deviation one. This step is important to ensure all variables are on the same scale, particularly if they were measured in different units or have outliers. To have a visual representation of the importance of this step less try to calculate how different are two people based on their height and weight.
![](img/95-scaling-importance-original.png){fig-align="center" width=90% height=90%}
In both cases we got the same distance and that doesn't make sense, so let's standardize the values with the next function.
$$
\text{height}_\text{scaled} = \frac{\text{height} - mean(\text{height})}{sd(\text{height})}
$$
![](img/96-scaling-importance-scaled.png){fig-align="center" width=90% height=90%}
> Returning a more intuive result.
(@) The **loading vector** is determined by solving the following optimization problem using *eigen decomposition*. This identifies the **direction with the largest variance** in the feature space and reveals the **relative contributions of the original features** to the new PCs.
$$
\underset{\phi_{11}, \dots, \phi_{p1}}{\text{maximize}}
\left\{ \frac{1}{n} \sum_{i = 1}^n
\left( \sum_{j=1}^p \phi_{j1} x_j \right)^2 \right\}
\text{ subject to }
\sum_{j=1}^p \phi_{j1}^2 = 1 .
$$
![](img/29-pca-data-example.png){fig-align="center" width=90% height=90%}
(@) Repeat the process until having $\min(n-1, p)$ distinct principal components. Each new component must be orthogonal to all previously computed principal components to ensure that each new component captures a **new direction of variance** assuring a new **uncorrelated** new component.
### R implementation
1. Getting the data
```{r}
my_basket <- fread("https://koalaverse.github.io/homlr/data/my_basket.csv")
```
2. Apply the PCA function
::: {.panel-tabset group="library"}
#### stats
```{r}
stats_pca <- prcomp(
# Numeric data.frame o matrix
my_basket,
# Set standard deviation to 1 before applying PCA
# this is really useful when sd is very different between columns
# but we don't nee
scale = TRUE,
# Set mean to 0 before applying PCA
# it's better to keep it TRUE
center = TRUE
)
```
After performing a PCA we can see the next elements:
```{r}
names(stats_pca)
```
- `center:` the column means used to center to the data, or FALSE if the data weren't centered.
- `scale`: the column standard deviations used to scale the data, or FALSE if the data weren't scaled.
- `rotation`: the directions of the principal component vectors in terms of the original features/variables. This information allows you to define new data in terms of the original principal components.
- `x`: the value of each observation in the original dataset projected to the principal components.
#### h2o
*Java is a prerequisite for H2O*.
``` {r}
# turn off progress bars for brevity
h2o.no_progress()
# connect to H2O instance with 5 gigabytes
h2o.init(max_mem_size = "5g")
# convert data to h2o object
my_basket.h2o <- as.h2o(my_basket)
```
- `pca_method`: When your data contains mostly numeric data use **“GramSVD”**, but if the data contain many categorical variables (or just a few categorical variables with high cardinality) we recommend to use **“GLRM”**.
- `k`: Integer specifying how many PCs to compute.Use `ncol(data)`.
- `transform`: Character string specifying how (if at all) your data should be standardized.
- `impute_missing`: Logical specifying whether or not to impute missing values with the **corresponding column mean**.
- `max_runtime_secs`: Number specifying the max run time (in seconds) to limit the runtime for model training.
``` {r}
h2o_pca <- h2o.prcomp(
training_frame = my_basket.h2o,
pca_method = "GramSVD",
k = ncol(my_basket.h2o),
transform = "STANDARDIZE",
impute_missing = TRUE,
max_runtime_secs = 1000
)
h2o.shutdown(prompt = FALSE)
```
#### recipes
```{r}
pca_rec <- recipe(~., data = my_basket) |>
step_normalize(all_numeric()) |>
step_pca(all_numeric(), id = "pca") |>
prep()
```
:::
### Extract the variance explained by each component
::: {.panel-tabset group="library"}
#### stats
```{r}
stats_pca_variance <-
tidy(stats_pca,
matrix = "eigenvalues")
setDT(stats_pca_variance)
stats_pca_variance[1:5]
```
#### h2o
``` {r}
h2o_pca@model$importance |>
t() |>
as.data.table(keep.rownames = "component") |>
head(5L)
```
#### recipes
```{r}
tidy(pca_rec,
id = "pca",
type = "variance") |>
filter(component <= 5) |>
pivot_wider(id_cols = component,
names_from = terms,
values_from = value)
```
:::
### Select the components to explore
#### Variance criterion
As the **sum of the variance (eigenvalues) of all the components is equal to the number of variables** entered into the PCA.
```{r}
stats_pca_variance[, .(total_variance = sum(std.dev^2))]
```
A variance of 1 means that the principal component would explain about one variable’s worth of the variability. In that sense we would just be interesting in selecting **components with variance 1 or greater**.
```{r}
stats_pca_variance[std.dev >= 1]
```
#### Scree plot criterion
The scree plot criterion looks for the “elbow” in the curve and **selects all components just before the line flattens out**, which looks like **8** in our example.
::: {.panel-tabset group="visualization"}
##### ggplot2
```{r}
stats_pca_variance |>
ggplot(aes(PC, percent, group = 1, label = PC)) +
geom_point() +
geom_line() +
geom_text(nudge_y = -.002,
check_overlap = TRUE)
```
##### factoextra
```{r}
fviz_eig(stats_pca,
geom = "line",
ncp = 30,
ggtheme = theme_light())+
geom_text(aes(label = dim),
nudge_y = -.2,
check_overlap = TRUE)
```
:::
#### Proportion of variance explained criterion
Depending of the use case the investigator might want to explain a particular proportion of variability. For example, to explain at least 75% of total variability we need to select the first 27 components.
```{r}
stats_pca_variance[cumulative >= 0.75][1L]
```
#### Conclusion
The frank answer is that **there is no one best method for determining how many components to use**. If we were merely trying to profile customers we would probably use 8 or 10, if we were performing dimension reduction to feed into a downstream predictive model we would likely retain 26 or more based on cross-validation.
### Interpret the results
- PC1 can be interpreted as the **Unhealthy Lifestyle** component, as the higher weights are associated with less healthy behaviors, such as alcohol (bulmers, red.wine, fosters, kronenbourg), sweets (mars, twix, kitkat), tobacco (cigarettes), and potentially gambling (lottery).
::: {.panel-tabset group="visualization"}
#### ggplot2
```{r }
stats_pca_loadings <-
tidy(stats_pca, matrix = "loadings")
setDT(stats_pca_loadings)
stats_pca_loadings[PC == 1
][order(-value)
][, rbind(.SD[1:10],
.SD[(.N-9L):.N])] |>
ggplot(aes(value, reorder(column, value))) +
geom_point()+
geom_vline(xintercept = 0,
linetype = 2,
linewidth = 1)+
labs(y = "Original Columns",
x = "Unhealthy / Entertainment Lifestyle")
```
#### factoextra
```{r}
fviz_contrib(stats_pca,
choice = "var",
axes = 1,
sort.val = "asc",
top = 25)+
coord_flip()
```
:::
- PC2 can be interpreted as the a **Dine-in Food Choices** component, as associated items are typically part of a main meal that one might consume for lunch or dinner.
::: {.panel-tabset group="visualization"}
#### ggplot2
```{r}
stats_pca_loadings[PC == 2
][order(-value)
][, rbind(.SD[1:10],
.SD[(.N-9L):.N])] |>
ggplot(aes(value, reorder(column, value))) +
geom_point()+
geom_vline(xintercept = 0,
linetype = 2,
linewidth = 1)+
labs(y = "Original Columns",
x = "Dine-in Food Choices")
```
#### factoextra
```{r}
fviz_contrib(stats_pca,
choice = "var",
axes = 2,
sort.val = "asc",
top = 25)+
coord_flip()
```
:::
- Find correlated features. As this can be hard we have many features let's use the `iris` data set.
```{r}
iris_pca <- prcomp(iris[-5])
```
::: {.panel-tabset group="visualization"}
#### ggplot2
```{r}
iris_pca$rotation |>
as.data.table(keep.rownames = "column") |>
ggplot(aes(`PC1`, `PC2`, label = column)) +
geom_vline(xintercept = 0,
linetype = 2,
linewidth = 0.3)+
geom_hline(yintercept = 0,
linetype = 2,
linewidth = 0.3)+
geom_text()+
geom_segment(aes(xend = `PC1`,
yend = `PC2`),
x = 0,
y = 0,
arrow = arrow(length = unit(0.15, "inches")))+
labs(title = "Component Loadings",
x = "PC 1",
y = "PC 2")
```
#### factoextra
```{r}
fviz_pca_var(iris_pca,
check_overlap = TRUE)+
labs(title = "Component Loadings",
x = "PC 1",
y = "PC 2")
```
:::
- Found out how much each feature contribute to new components.
::: {.panel-tabset group="visualization"}
#### ggplot2
```{r}
stats_pca_loadings[PC <= 2L
][, contrib := sqrt(sum(value^2)),
by = "column"] |>
dcast(column + contrib ~ PC,
value.var = "value") |>
ggplot(aes(`1`, `2`, label = column)) +
geom_text(aes(alpha = contrib),
color = "navyblue",
check_overlap = TRUE)+
geom_vline(xintercept = 0,
linetype = 2,
linewidth = 0.3)+
geom_hline(yintercept = 0,
linetype = 2,
linewidth = 0.3)+
labs(title = "Component Loadings",
x = paste0("Unhealthy Lifestyle (",
scales::percent(stats_pca_variance$percent[1L],
accuracy = 0.01),")"),
y = paste0("Dine-in Food Choices (",
scales::percent(stats_pca_variance$percent[2L],
accuracy = 0.01),")"))+
theme(legend.position = "none")
```
#### factoextra
```{r}
fviz_pca_var(stats_pca,
geom = "text",
col.var = "contrib",
check_overlap = TRUE)+
scale_color_gradient(low = NA, high = "navyblue")+
coord_cartesian(xlim = c(-0.5,0.5),
ylim = c(-0.5,0.5))+
theme(legend.position = "none")
```
:::
## Clustering
It refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. Some applications could be to:
- Find few different unknown **subtypes of breast cancer**.
- Perform **market segmentation** by identify **subgroups of people** who might be more likely to purchase a particular product.
### K-means clustering
In *K-means clustering*, we seek to partition the observations into a **pre-specified number of non-overlapping clusters** $K$.
![](img/80-k-mean-clustering-example.png){fig-align="center"}
For this method, the main goal is to classify observations within clusters with **high intra-class similarity** *(low within-cluster variation)*, but with **low inter-class similarity**.
#### Mathematical Description
Let $C_1, \dots, C_K$ denote sets containing the **indices of the observations** in each cluster, where:
- Each observation belongs to at least one of the $K$ clusters. $C_1 \cup C_2 \cup \dots \cup C_K = \{1, \dots,n\}$
- No observation belongs to more than one cluster. $C_k \cap C_{k'} = \emptyset$ for all $k \neq k'$.
$$
\underset{C_1. \dots, C_K}{\text{minimize}} =
\left\{ \sum_{k=1}^k W(C_k) \right\}
$$
$W(C_k)$ represent the amount by which the observations within a cluster differ from each other. There are many possible ways to define this concept, but the most common choice involves **squared euclidean distance**, which is *sensitive to outliers* and works better with **gaussian distributed** features.
$$
W(C_k) = \frac{1}{| C_k|} \sum_{i,i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2
$$
Where:
- $|C_k|$: Denotes the number of observations in the $k$th cluster
##### Distance alternatives
Some alternatives to the **euclidean distance** more robust to outliers and Non-normal distributions are:
- Manhattan distance
- Minkowski distance
- Gower distance
If we want to calculate the **similarity** for binary variables or categorical variables after applying one-hot encoding, we can use the **Jaccard Index** to calculate the distance.
$$
\begin{split}
J(A,B) & = \frac{\overbrace{A \cap B}^\text{elements common to A and B}}{\underbrace{A \cup B}_\text{all elements in A and B}} \\
\text{Distance} & = 1 - J(A,B)
\end{split}
$$
We can find the distance related to the *Jaccard Index* by typing the next function in `R`.
```r
dist(x, method = "binary")
```
If you are analyzing unscaled data where observations may have large differences in magnitude but similar behavior then a **correlation-based distance** is preferred like:
- $1 - \text{Pearson correlation}$
- $1 - \text{Spearman correlation}$
- $1 - \text{Kendall Tau correlation}$
![](img/84-k-mean-clustering-scaling-problem.png){fig-align="center"}
#### Aproximation algorithm
As solving this problem would be very difficult, since there are almost $K^n$ ways to partition n observations into $K$ clusters, but we can use a very simple algorithm to find **local optimum**.
![](img/81-K-Means-Clustering-Algorithm.png){fig-align="center"}
![](img/82-K-Means-Clustering-Algorithm-Example.png){fig-align="center"}
Since the results obtained will depend on the **initial (random) cluster assignment** of each observation it is important to **run the algorithm multiple times** (10-20) from different random initial configurations *(random starts)*. Then one selects the solution with the **smallest objective**.
![](img/83-K-Means-Clustering-Algorithm-Example-Selection.png){fig-align="center"}
::: {.callout-tip}
##### Validation tip
To validate that we are modeling signal rather than noise the algorithm should produce **similar clusters in each iteration**.
:::
#### Coding example
To perform k-means clustering on mixed data we need to:
- Convert any ordinal categorical variables to numeric
- Convert nominal categorical variables to one-hot encode
- Scale all variables
```{r}
ames_scale <- AmesHousing::make_ames() |>
# select numeric columns
select_if(is.numeric) |>
# remove target column
select(-Sale_Price) |>
# coerce to double type
mutate_all(as.double) |>
# center & scale the resulting columns
scale()
```
- Compute the distances between the rows of a data matrix
```{r}
# Dissimilarity matrix
ames_dist <- dist(
ames_scale,
method = "euclidean"
)
```
- Perform k-means clustering after setting a seed
```{r}
# For reproducibility
set.seed(123)
ames_kmeans <- kmeans(
ames_dist,
# Number of groups
centers = 10,
# Number of models to create
# the it selects the best one
nstart = 10,
# Max number of iterations for each model
iter.max = 10
)
```
- Measure model's quality (*Total within-cluster sum of squares*). Which represent the **sum all squared distances from each observation to its cluster center** in the model.
```{r}
ames_kmeans$tot.withinss |> comma()
```
### Hierarchical clustering
- It doesn't require to define the number of clusters.
- It returns an attractive tree-based representation (**dendrogram**).
> It assumes that clusters are nested, but that isn't true k-means clustering coud yield better.
#### Understanding dendrograms
In general we can say that:
- Each **leaf** represent an observation
- **Similar** the groups of observations are **lower** in the tree
- **Different** the groups of observations are near the **top** of the tree
- The **height of the cut** controls the **number of clusters** obtained.
![](img/85-hierarchical-clustering-clusters.png){fig-align="center"}
In the next example:
- {1,6} and {5,7} are close observations
![](img/86-hierarchical-clustering-simularity-example1.png){fig-align="center"}
- Observation 9 is no more similar to observation 2 than it
is to observations 8, 5, and 7, as it was **fused at higher height of the cut**.
![](img/87-hierarchical-clustering-9-similarity.png){fig-align="center"}
#### Hierarchical Clustering Types
![AGNES (bottom-up) versus DIANA (top-down) clustering](img/93-hierarchical-clustering-types.png){fig-align="center"}
##### Agglomerative Clustering (AGNES, Bottom-up)
1. Defining a **dissimilarity measure** between
each pair of observations, like **Euclidean distance** and **correlation-based distance**.
2. Defining each of the $n$ observations as a *cluster*.
3. **Fusing the most similar 2 clusters** and repeating the process until all the observations belong to **one single cluster**.
- Step 1
![](img/88-hierarchical-clustering-step1.png){fig-align="center"}
- Step 2
![](img/89-hierarchical-clustering-step2.png){fig-align="center"}
- Step 3
![](img/90-hierarchical-clustering-step3.png){fig-align="center"}
> It is good at identifying **small** clusters.
##### Divisive Clustering (DIANA, top-down)
1. Defining a **dissimilarity measure** between
each pair of observations, like **Euclidean distance** and **correlation-based distance**.
2. Defining the root, in which all observations are included in a single cluster.
3. The current cluster is split into two clusters that are considered most heterogeneous. The process is iterated until all observations are in their own cluster.
> It is good at identifying **large** clusters.
#### Linkage
It measures the dissimilarity between two clusters and defines if we will have a **balanced tree** where each cluster is assigned to an *even number of observations*, or an **unbalanced tree** to find *outliers*.
- AGNES clustering
- **Complete (maximal intercluster dissimilarity)**: Record the **largest dissimilarity** between cluster $A$ and $B$. It tends to produce more **compact clusters** and **balanced trees**.
- **Ward’s minimum variance**: Minimizes the total within-cluster variance. At each step the pair of clusters with the smallest between-cluster distance are merged. Tends to **produce more compact clusters**.
- DIANA clustering
- **Average (mean intercluster dissimilarity)**: Record the **average dissimilarity** between cluster $A$ and $B$. It can **vary** in the compactness of the clusters it creates, but must of the time produces **balanced trees**.
- **Single (minimal intercluster dissimilarity)**: Record the **smallest dissimilarity** between cluster $A$ and $B$. It tends to produce more **extended clusters** and **unbalanced trees**.
- **Centroid**: Computes the dissimilarity between the centroid for cluster $A$ (a mean vector of length $p$, one element for each variable) and the centroid for cluster $B$. It is often used in genomics, but *inversions* can lead to **difficulties** in visualizing and interpreting of the dendrogram.
![](img/92-hierarchical-clustering-linkage-examples.png){fig-align="center"}
#### Coding example
Once we have our distance matrix we just need to define a *linkage method*. The default is the **complete** one.
```{r}
ames_hclust <- hclust(
ames_dist,
method = "ward.D"
)
```
##### Defining linkage method
To define the best `method` we can use the `coef.hclust` function from the `cluster` package to extract the **agglomeration coefficients** from the result of a hierarchical cluster analysis, which indicate the **cost of merging different clusters at each stage** of the clustering process. As result, the higher this value is, the more dissimilar the clusters being merged are.
```{r}
cluster_methods <- c(
"average",
"single",
"complete",
"ward.D"
)
setattr(cluster_methods,
"names",
cluster_methods)
sapply(cluster_methods,
\(x) fastcluster::hclust(ames_dist, method = x) |>
coef.hclust() ) |>
sort(decreasing = TRUE)
```
As we were expecting the the best result was using the "ward.D" linkage as the default R `hclust` performs a bottom-up algorithm.
##### Plotting Dendrogram
This dendrogram takes too much to be plotted.
::: {.panel-tabset}
###### stats
```{r}
#| eval: false
plot(ames_hclust)
abline(h = 800, col = "red")
```
###### dendextend
```{r}
#| eval: false
as.dendrogram(ames_hclust) |>
color_branches(h = 800) |>
plot()
```
###### factoextra
```{r}
#| eval: false
fviz_dend(ames_hclust, k = 2)
```
![](img/94-ames-dendrogram.png){fig-align="center"}
:::
##### Getting clusters
To get the clusters we have 2 alternatives:
- Defining the number of clusters to export.
```{r}
cutree(ames_hclust, k = 2) |> table() |> prop.table()
```
- Defining the height where the tree should be cut, which returns groups where the **members of the created clusters** have an **euclidean distance** amongst each other *no greater than our cut height* if where are using the **complete linkage**.
```{r}
cutree(ames_hclust, h = 400) |> table() |> prop.table()
```
### Partitioning around medians (PAM)
It has the same algorithmic steps as k-means but uses the **median** rather than the mean to determine the centroid; making it more robust to outliers.
As your data becomes more sparse the performance of k-means and hierarchical clustering become *slow* and *ineffective*. An alternative is to use the **Gower distance**, which applies a particular distance calculation that works well for each data type.
- **quantitative (interval)**: range-normalized Manhattan distance.
- **ordinal**: variable is first ranked, then Manhattan distance is used with a special adjustment for ties.
- **nominal**: variables with $k$ categories are first converted into $k$ binary columns (i.e., one-hot encoded) and then the **Dice coefficient** is used. To compute the dice metric for two observations $(X,Y)$ the algorithm looks across all one-hot encoded categorical variables and scores them as:
- **a** — number of dummies 1 for both observations
- **b** — number of dummies 1 for $X$ and 0 for $Y$
- **c** - number of dummies 0 for $X$ and 1 for $Y$
- **d** — number of dummies 0 for both
and then uses the following formula:
$$
D = \frac{2a}{2a + b +c}
$$
```{r}
# Original data minus Sale_Price
ames_full <-
AmesHousing::make_ames() |>
subset(select = -Sale_Price)
# Compute Gower distance for original data
gower_dst <- daisy(ames_full, metric = "gower")
# You can supply the Gower distance matrix to several clustering algos
pam_gower <- pam(x = gower_dst, k = 8, diss = TRUE)
```
### Clustering large applications (CLARA)
It uses the same algorithmic process as PAM; however, instead of finding the medoids for the entire data set it considers a small sample size and applies k-means or PAM.
```{r}
clara(my_basket, k = 10)
```
### Selecting number of clusters
As $k$ increases the **homogeneity** between observations in each cluster, but it also increases the risk to overfit the model, but it is important to know that **there is not a definitive answer to this question**. Some possibilities are:
- Defining $k$ based on **domain knowledge** or resources **limitations**. For example, you want to divide customers in 4 groups as you only 4 employees to execute the plan.
- **Optimizing a criterion**
- The ***elbow method*** tries to minimize the total intra-cluster variation (within-cluster sum of square, WSS) by selecting the number of clusters so that ***adding another cluster doesn’t improve much better the total WSS***.
https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/
::: {.panel-tabset group="algorithm"}
#### HC
```{r}
#| eval: false
fviz_nbclust(ames_scale,
FUNcluster = hcut,
method = "wss",
hc_method = "ward.D") +
labs(subtitle = "Elbow method")
```
![](img/103-ames-hcut-Elbow.png){fig-align="center" width=90% height=90%}
#### K-means
```{r}
#| eval: false
fviz_nbclust(ames_scale,
FUNcluster = kmeans,
method = "wss") +
labs(subtitle = "Elbow method")
```
![](img/104-ames-kmeans-Elbow.png){fig-align="center" width=90% height=90%}
:::
- The ***silhouette method*** allows you to calculate how similar each observations is with the cluster it is assigned relative to other clusters.. The optimal number of clusters $k$ is the one that ***maximize the average silhouette over a range of possible values for ***$k$.
- Values close to **1** suggest that the observation is well matched to the assigned cluster.
- Values close to **0** suggest that the observation is borderline matched between two clusters.
- Values close to **-1** suggest that the observations may be assigned to the wrong cluster
::: {.panel-tabset group="algorithm"}
#### HC
```{r}
#| eval: false
fviz_nbclust(ames_scale,
FUNcluster = hcut,
method = "silhouette",
hc_method = "ward.D") +
labs(subtitle = "Silhouette method")
```
![](img/105-ames-hcut-silhouette.png){fig-align="center" width=90% height=90%}
#### K-means
```{r}
#| eval: false
fviz_nbclust(ames_scale,
FUNcluster = kmeans,
method = "silhouette") +
labs(subtitle = "Silhouette method")
```
![](img/106-ames-kmeans-silhouette.png){fig-align="center" width=90% height=90%}
:::
- **Gap statistic**: Compares the total within intra-cluster variation for different values of $k$ with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic, so the clustering structure would be **far away from the random uniform distribution of points**.
$$
\text{Gap}(k) = \frac{1}{B} \sum_{b=1}^B \log(W^*_{kb}) - \log(W_k)
$$
::: {.panel-tabset group="algorithm"}
#### HC
```{r}
#| eval: false
set.seed(123)
fviz_nbclust(ames_scale,
FUNcluster = hcut,
method = "gap_stat",
hc_method = "ward.D",
nboot = 50,
verbose = FALSE) +
labs(subtitle = "Gap statistic method")
```
![](img/107-ames-hcut-gap.png){fig-align="center" width=90% height=90%}
#### K-means
```{r}
#| eval: false
set.seed(123)
fviz_nbclust(ames_scale,
FUNcluster = kmeans,
method = "gap_stat",
nboot = 50,
nstart = 25,
verbose = FALSE)+
labs(subtitle = "Gap statistic method")
```
![](img/108-ames-kmeans-gap.png){fig-align="center" width=90% height=90%}
:::
- As there there are more than 30 indices and methods to we can compute all them in order to decide the best number of clusters using the **majority rule**
## Exploratory Factor Analysis
When running the EFA we get two important results
- **Variable' factor loadings**: Quantify the relationship (correlation) between each factor and each feature.
- **Observations' factor scores**: Estimates much of each factor each observation present.
### Coding Example
1. Loading libraries to use
```{r}
library(ade4)
library(GPArotation)
library(psych)
data(olympic)
OlympicDt <- as.data.table(olympic$tab)
extract_fa_summary <- function(x){
as.data.table(
# The [] are really good trick
# to transform object class from loadings to matrix
x$loadings[],
keep.rownames = "col_names"
)[, `:=`(complexity = x$complexity,
communalities = x$communalities,
uniquenesses = x$uniquenesses)][]
}
```
2. Run the default EFA