09-unsupervised-learning.qmd

---
format:
  html:
    number-depth: 3
    css: summary-format.css
---

# Unsupervised Learning

It refers to a set of statistical tools intended for the setting in which we have only a set of features $X_1,X_2, \dots ,X_p$ measured on $n$ observations is to discover interesting things about the measurements like:

-   Finding an informative way to visualize and explore the data
-   Imputing missing values
-   Discovering subgroups among the variables or among the observations

As result, the exercise tends to be **more subjective**.

## Libraries to use

```{r}
#| output: false

# Data manipulation
library(data.table)
library(recipes)

# Data visualization
library(ggplot2)     
theme_set(theme_light())

# Model creation
library(h2o)
library(cluster)
library(tidymodels)
library(tidyclust)

# Extracting model information
library(broom)
library(factoextra)
library(dendextend)
library(NbClust)
```


## Principal Components Analysis (PCA)

### Purpose

As the **number of variables increases** checking *two-dimensional scatterplots* **gets less insightful** since they each contain just a small fraction of the total information present in the data set.

For example, if we see correlations between features is easy to create more general categories known as **latent variable** as follow:

- Sandwich
  - cheese - mayonnaise
  - mayonnaise - bread
  - bread - cheese
  - bread -	lettuce
- Soda
  - pepsi - coke
  - 7up - coke
- Vegetables
  - spinach -	broccoli
  - peas - potatoes
  - peas - carrots


At the end this process can help to:

- Reduce the number of featured need to describe the data
- Remove multicollinearity between features
- 
  

### Mathematical Description

PCA finds a **low-dimensional representation** of a data set that **contains as much variation (information)** as possible. It assumes that all dimensions can be described as a **linear combination** of the $p$ original variables, known as **principal component scores**.

$$
Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p
$$

Where:

- $\phi_1 = (\phi_{11} \phi_{21} \dots \phi_{p1})^T$ represent the loading vector for the first principal component.
- $\sum_{j=1}^p \phi_{j1}^2 = 1$

### Steps to apply

To perform a *principal components analysis* (PCA) need to:

(@) Make any needed transformation to *tidy the data*.

(@) Remove or impute any *missing value*.

(@) Transform or variables to be numeric by using method like *one-hot encoding*.

(@) **Center** and **scale** all the variables in $\mathbf{X}$  to have mean zero and standard deviation one. This step is important to ensure all variables are on the same scale, particularly if they were measured in different units or have outliers. To have a visual representation of the importance of this step less try to calculate how different are two people based on their height and weight.

![](img/95-scaling-importance-original.png){fig-align="center" width=90% height=90%}

In both cases we got the same distance and that doesn't make sense, so let's standardize the values with the next function.

$$
\text{height}_\text{scaled} = \frac{\text{height} - mean(\text{height})}{sd(\text{height})}
$$

![](img/96-scaling-importance-scaled.png){fig-align="center" width=90% height=90%}

> Returning a more intuive result.

(@) The **loading vector** is determined by solving the following optimization problem using *eigen decomposition*. This identifies the **direction with the largest variance** in the feature space and reveals the **relative contributions of the original features** to the new PCs.

$$
\underset{\phi_{11}, \dots, \phi_{p1}}{\text{maximize}}
\left\{ \frac{1}{n} \sum_{i = 1}^n
\left( \sum_{j=1}^p \phi_{j1} x_j \right)^2 \right\} 
\text{ subject to }
\sum_{j=1}^p \phi_{j1}^2 = 1 .
$$

![](img/29-pca-data-example.png){fig-align="center" width=90% height=90%}

(@) Repeat the process until having $\min(n-1, p)$ distinct principal components. Each new component must be orthogonal to all previously computed principal components to ensure that each new component captures a **new direction of variance** assuring a new **uncorrelated** new component.


### R implementation

1. Getting the data

```{r}
my_basket <- fread("https://koalaverse.github.io/homlr/data/my_basket.csv")
```

2. Apply the PCA function

::: {.panel-tabset group="library"}

#### stats

```{r}
stats_pca <- prcomp(
  # Numeric data.frame o matrix
  my_basket,
  
  # Set standard deviation to 1 before applying PCA
  # this is really useful when sd is very different between columns
  # but we don't nee
  scale = TRUE,
  
  # Set mean to 0 before applying PCA
  # it's better to keep it TRUE
  center = TRUE
)
```

After performing a PCA we can see the next elements:

```{r}
names(stats_pca)
```

- `center:` the column means used to center to the data, or FALSE if the data weren't centered.
- `scale`: the column standard deviations used to scale the data, or FALSE if the data weren't scaled.
- `rotation`: the directions of the principal component vectors in terms of the original features/variables. This information allows you to define new data in terms of the original principal components.
- `x`: the value of each observation in the original dataset projected to the principal components.

#### h2o

*Java is a prerequisite for H2O*.

``` {r}
# turn off progress bars for brevity
h2o.no_progress()  

# connect to H2O instance with 5 gigabytes
h2o.init(max_mem_size = "5g")  

# convert data to h2o object
my_basket.h2o <- as.h2o(my_basket)
```

- `pca_method`: When your data contains mostly numeric data use **“GramSVD”**, but if the data contain many categorical variables (or just a few categorical variables with high cardinality) we recommend to use **“GLRM”**.
- `k`: Integer specifying how many PCs to compute.Use `ncol(data)`.
- `transform`: Character string specifying how (if at all) your data should be standardized.
- `impute_missing`: Logical specifying whether or not to impute missing values with the **corresponding column mean**.
- `max_runtime_secs`: Number specifying the max run time (in seconds) to limit the runtime for model training.


``` {r}
h2o_pca <- h2o.prcomp(
  training_frame = my_basket.h2o,
  pca_method = "GramSVD",
  k = ncol(my_basket.h2o), 
  transform = "STANDARDIZE", 
  impute_missing = TRUE,
  max_runtime_secs = 1000
)

h2o.shutdown(prompt = FALSE)
```

#### recipes

```{r}
pca_rec <- recipe(~., data = my_basket) |>
  step_normalize(all_numeric()) |>
  step_pca(all_numeric(), id = "pca") |>
  prep()
```

:::

### Extract the variance explained by each component

::: {.panel-tabset group="library"}

#### stats

```{r}
stats_pca_variance <-
  tidy(stats_pca,
       matrix = "eigenvalues")

setDT(stats_pca_variance)

stats_pca_variance[1:5]
```


#### h2o

``` {r}
h2o_pca@model$importance |>
  t() |>
  as.data.table(keep.rownames = "component") |>
  head(5L)
```

#### recipes

```{r}
tidy(pca_rec,
     id = "pca",
     type = "variance") |>
  filter(component <= 5) |>
  pivot_wider(id_cols = component,
              names_from = terms,
              values_from = value)
```

:::

### Select the components to explore

#### Variance criterion

As the **sum of the variance (eigenvalues) of all the components is equal to the number of variables** entered into the PCA. 

```{r}
stats_pca_variance[, .(total_variance = sum(std.dev^2))]
```

A variance of 1 means that the principal component would explain about one variable’s worth of the variability. In that sense we would just be interesting in selecting **components with variance 1 or greater**.

```{r}
stats_pca_variance[std.dev >= 1]
```


#### Scree plot criterion

The scree plot criterion looks for the “elbow” in the curve and **selects all components just before the line flattens out**, which looks like **8** in our example.

::: {.panel-tabset group="visualization"}

##### ggplot2

```{r}
stats_pca_variance |>
  ggplot(aes(PC, percent, group = 1, label = PC)) +
  geom_point() +
  geom_line() +
  geom_text(nudge_y = -.002,
            check_overlap = TRUE)
```

##### factoextra

```{r}
fviz_eig(stats_pca,
         geom = "line",
         ncp = 30,
         ggtheme = theme_light())+
  geom_text(aes(label = dim),
            nudge_y = -.2,
            check_overlap = TRUE)
```

:::

#### Proportion of variance explained criterion

Depending of the use case the investigator might want to explain a particular proportion of variability. For example, to explain at least 75% of total variability we need to select the first 27 components.

```{r}
stats_pca_variance[cumulative >= 0.75][1L]
```

#### Conclusion

The frank answer is that **there is no one best method for determining how many components to use**. If we were merely trying to profile customers we would probably use 8 or 10, if we were performing dimension reduction to feed into a downstream predictive model we would likely retain 26 or more based on cross-validation.


### Interpret the results

- PC1 can be interpreted as the **Unhealthy Lifestyle** component, as the higher weights are associated with less healthy behaviors, such as alcohol (bulmers, red.wine, fosters, kronenbourg), sweets (mars, twix, kitkat), tobacco (cigarettes), and potentially gambling (lottery).

::: {.panel-tabset group="visualization"}

#### ggplot2

```{r }
stats_pca_loadings <-
  tidy(stats_pca, matrix = "loadings")

setDT(stats_pca_loadings)

stats_pca_loadings[PC == 1
  ][order(-value)
  ][, rbind(.SD[1:10],
            .SD[(.N-9L):.N])] |>
  ggplot(aes(value, reorder(column, value))) +
  geom_point()+
  geom_vline(xintercept = 0,
             linetype = 2,
             linewidth = 1)+
  labs(y = "Original Columns",
       x = "Unhealthy / Entertainment Lifestyle")
```

#### factoextra

```{r}
fviz_contrib(stats_pca, 
             choice = "var", 
             axes = 1,
             sort.val = "asc",
             top = 25)+
  coord_flip()
```

:::

- PC2 can be interpreted as the a **Dine-in Food Choices** component, as  associated items are typically part of a main meal that one might consume for lunch or dinner.

::: {.panel-tabset group="visualization"}

#### ggplot2

```{r}
stats_pca_loadings[PC == 2
  ][order(-value)
  ][, rbind(.SD[1:10],
            .SD[(.N-9L):.N])] |>
  ggplot(aes(value, reorder(column, value))) +
  geom_point()+
  geom_vline(xintercept = 0,
             linetype = 2,
             linewidth = 1)+
  labs(y = "Original Columns",
       x = "Dine-in Food Choices")
```

#### factoextra

```{r}
fviz_contrib(stats_pca, 
             choice = "var", 
             axes = 2,
             sort.val = "asc",
             top = 25)+
  coord_flip()
```

:::

- Find correlated features. As this can be hard we have many features let's use the `iris` data set.

```{r}
iris_pca <- prcomp(iris[-5])
```


::: {.panel-tabset group="visualization"}

#### ggplot2

```{r}
iris_pca$rotation |>
  as.data.table(keep.rownames = "column") |>
  ggplot(aes(`PC1`, `PC2`, label = column)) +
  geom_vline(xintercept = 0,
             linetype = 2,
             linewidth = 0.3)+
  geom_hline(yintercept = 0,
             linetype = 2,
             linewidth = 0.3)+
  geom_text()+
  geom_segment(aes(xend = `PC1`,
                   yend = `PC2`),
               x = 0,
               y = 0,
               arrow = arrow(length = unit(0.15, "inches")))+
  labs(title = "Component Loadings",
       x = "PC 1",
       y = "PC 2")
```

#### factoextra

```{r}
fviz_pca_var(iris_pca,
             check_overlap = TRUE)+
  labs(title = "Component Loadings",
       x = "PC 1",
       y = "PC 2")
```

:::


- Found out how much each feature contribute to new components.

::: {.panel-tabset group="visualization"}

#### ggplot2

```{r}
stats_pca_loadings[PC <= 2L
  ][, contrib := sqrt(sum(value^2)),
    by = "column"] |>
  dcast(column + contrib ~ PC, 
        value.var = "value") |>
  ggplot(aes(`1`, `2`, label = column)) +
  geom_text(aes(alpha = contrib),
            color = "navyblue",
            check_overlap = TRUE)+
  geom_vline(xintercept = 0,
             linetype = 2,
             linewidth = 0.3)+
  geom_hline(yintercept = 0,
             linetype = 2,
             linewidth = 0.3)+
  labs(title = "Component Loadings",
       x = paste0("Unhealthy Lifestyle (",
                  scales::percent(stats_pca_variance$percent[1L],
                                  accuracy = 0.01),")"),
       y = paste0("Dine-in Food Choices (",
                  scales::percent(stats_pca_variance$percent[2L],
                                  accuracy = 0.01),")"))+
  theme(legend.position = "none")
```

#### factoextra

```{r}
fviz_pca_var(stats_pca,
             geom = "text",
             col.var = "contrib",
             check_overlap = TRUE)+
  scale_color_gradient(low = NA, high = "navyblue")+
  coord_cartesian(xlim = c(-0.5,0.5),
                  ylim = c(-0.5,0.5))+
  theme(legend.position = "none")
```

:::

## Clustering

It refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. Some applications could be to:

- Find few different unknown **subtypes of breast cancer**.
- Perform **market segmentation** by identify **subgroups of people** who might be more likely to purchase a particular product.


### K-means clustering

In *K-means clustering*, we seek to partition the observations into a **pre-specified number of non-overlapping clusters** $K$.


![](img/80-k-mean-clustering-example.png){fig-align="center"}

For this method, the main goal is to classify observations within clusters with **high intra-class similarity** *(low within-cluster variation)*, but with **low inter-class similarity**.

#### Mathematical Description

Let $C_1, \dots, C_K$ denote sets containing the **indices of the observations** in each cluster, where:

- Each observation belongs to at least one of the $K$ clusters. $C_1  \cup C_2 \cup \dots \cup C_K = \{1, \dots,n\}$

- No observation belongs to more than one cluster. $C_k \cap C_{k'} = \emptyset$ for all $k \neq k'$.

$$
\underset{C_1. \dots, C_K}{\text{minimize}} =
\left\{ \sum_{k=1}^k W(C_k) \right\}
$$

$W(C_k)$ represent the amount by which the observations within a cluster differ from each other. There are many possible ways to define this concept, but the most common choice involves **squared euclidean distance**, which is *sensitive to outliers* and works better with **gaussian distributed** features.

$$
W(C_k) = \frac{1}{| C_k|} \sum_{i,i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2
$$
Where:

- $|C_k|$: Denotes the number of observations in the $k$th cluster

##### Distance alternatives

Some alternatives to the **euclidean distance** more robust to outliers and Non-normal distributions are:

- Manhattan distance
- Minkowski distance
- Gower distance

If we want to calculate the **similarity** for binary variables or categorical variables after applying one-hot encoding, we can use the **Jaccard Index** to calculate the distance.

$$
\begin{split}
J(A,B) & = \frac{\overbrace{A \cap B}^\text{elements common to A and B}}{\underbrace{A \cup B}_\text{all elements in A and B}} \\
\text{Distance} & = 1 - J(A,B)
\end{split}
$$

We can find the distance related to the *Jaccard Index* by typing the next function in `R`.

```r
dist(x, method = "binary")
```

If you are analyzing unscaled data where observations may have large differences in magnitude but similar behavior then a **correlation-based distance** is preferred like:

- $1 - \text{Pearson correlation}$
- $1 - \text{Spearman correlation}$
- $1 - \text{Kendall Tau correlation}$

![](img/84-k-mean-clustering-scaling-problem.png){fig-align="center"}

#### Aproximation algorithm

As solving this problem would be very difficult, since there are almost $K^n$ ways to partition n observations into $K$ clusters, but we can use a very simple algorithm to find **local optimum**.

![](img/81-K-Means-Clustering-Algorithm.png){fig-align="center"}

![](img/82-K-Means-Clustering-Algorithm-Example.png){fig-align="center"}

Since the results obtained will depend on the **initial (random) cluster assignment** of each observation it is important to **run the algorithm multiple times** (10-20) from different random initial configurations *(random starts)*. Then one selects the solution with the **smallest objective**.

![](img/83-K-Means-Clustering-Algorithm-Example-Selection.png){fig-align="center"}

::: {.callout-tip}
##### Validation tip

To validate that we are modeling signal rather than noise the algorithm should produce **similar clusters in each iteration**.
:::

#### Coding example

To perform k-means clustering on mixed data we need to:

- Convert any ordinal categorical variables to numeric
- Convert nominal categorical variables to one-hot encode
- Scale all variables

```{r}
ames_scale <- AmesHousing::make_ames() |>
  # select numeric columns
  select_if(is.numeric) |>
  
  # remove target column
  select(-Sale_Price) |>
  
  # coerce to double type
  mutate_all(as.double) |>
  
  # center & scale the resulting columns
  scale()                  
```

- Compute the distances between the rows of a data matrix

```{r}
# Dissimilarity matrix
ames_dist <- dist(
  ames_scale,
  method = "euclidean"
)
```


- Perform k-means clustering after setting a seed

```{r}
# For reproducibility
set.seed(123)

ames_kmeans <- kmeans(
  ames_dist,
  # Number of groups
  centers = 10,
  
  # Number of models to create
  # the it selects the best one
  nstart = 10,
  
  # Max number of iterations for each model
  iter.max = 10
)
```


- Measure model's quality (*Total within-cluster sum of squares*). Which represent the **sum all squared distances from each observation to its cluster center** in the model.

```{r}
ames_kmeans$tot.withinss |> comma()
```

### Hierarchical clustering

- It doesn't require to define the number of clusters.
- It returns an attractive tree-based representation (**dendrogram**).

> It assumes that clusters are nested, but that isn't true k-means clustering coud yield better.


#### Understanding dendrograms

In general we can say that:

- Each **leaf** represent an observation
- **Similar** the groups of observations are **lower** in the tree
- **Different** the groups of observations are near the **top** of the tree
- The **height of the cut** controls the **number of clusters** obtained.

![](img/85-hierarchical-clustering-clusters.png){fig-align="center"}

In the next example:

- {1,6} and {5,7} are close observations

![](img/86-hierarchical-clustering-simularity-example1.png){fig-align="center"}

- Observation 9 is no more similar to observation 2 than it
is to observations 8, 5, and 7, as it was **fused at higher height of the cut**.

![](img/87-hierarchical-clustering-9-similarity.png){fig-align="center"}

#### Hierarchical Clustering Types 

![AGNES (bottom-up) versus DIANA (top-down) clustering](img/93-hierarchical-clustering-types.png){fig-align="center"}

##### Agglomerative Clustering (AGNES, Bottom-up)

1. Defining a **dissimilarity measure** between
each pair of observations, like **Euclidean distance** and **correlation-based distance**.
2. Defining each of the $n$ observations as a *cluster*.
3. **Fusing the most similar 2 clusters** and repeating the process until all the observations belong to **one single cluster**.

- Step 1

![](img/88-hierarchical-clustering-step1.png){fig-align="center"}

- Step 2

![](img/89-hierarchical-clustering-step2.png){fig-align="center"}

- Step 3

![](img/90-hierarchical-clustering-step3.png){fig-align="center"}

> It is good at identifying **small** clusters.

##### Divisive Clustering (DIANA, top-down)

1. Defining a **dissimilarity measure** between
each pair of observations, like **Euclidean distance** and **correlation-based distance**.
2. Defining the root, in which all observations are included in a single cluster. 
3. The current cluster is split into two clusters that are considered most heterogeneous. The process is iterated until all observations are in their own cluster.

> It is good at identifying **large** clusters.

#### Linkage

It measures the dissimilarity between two clusters and defines if we will have a **balanced tree** where each cluster is assigned to an *even number of observations*, or an **unbalanced tree** to find *outliers*.

- AGNES clustering
  - **Complete (maximal intercluster dissimilarity)**: Record the **largest dissimilarity** between cluster $A$ and $B$. It tends to produce more **compact clusters** and **balanced trees**.
  - **Ward’s minimum variance**: Minimizes the total within-cluster variance. At each step the pair of clusters with the smallest between-cluster distance are merged. Tends to **produce more compact clusters**.

- DIANA clustering
  - **Average (mean intercluster dissimilarity)**: Record the **average dissimilarity** between cluster $A$ and $B$. It can **vary** in the compactness of the clusters it creates, but must of the time produces **balanced trees**.

- **Single (minimal intercluster dissimilarity)**: Record the **smallest dissimilarity** between cluster $A$ and $B$. It tends to produce more **extended clusters** and **unbalanced trees**.

- **Centroid**: Computes the dissimilarity between the centroid for cluster $A$ (a mean vector of length $p$, one element for each variable) and the centroid for cluster $B$. It is often used in genomics, but *inversions* can lead to **difficulties** in visualizing and interpreting of the dendrogram.

![](img/92-hierarchical-clustering-linkage-examples.png){fig-align="center"}

#### Coding example

Once we have our distance matrix we just need to define a *linkage method*. The default is the **complete** one.

```{r}
ames_hclust <- hclust(
  ames_dist, 
  method = "ward.D"
)
```

##### Defining linkage method

To define the best `method` we can use the `coef.hclust` function from the `cluster` package to extract the **agglomeration coefficients** from the result of a hierarchical cluster analysis, which indicate the **cost of merging different clusters at each stage** of the clustering process. As result, the higher this value is, the more dissimilar the clusters being merged are.

```{r}
cluster_methods <- c(
  "average",
  "single",
  "complete",
  "ward.D"
)

setattr(cluster_methods,
        "names",
        cluster_methods)

sapply(cluster_methods,
       \(x) fastcluster::hclust(ames_dist, method = x) |>
        coef.hclust() ) |> 
  sort(decreasing = TRUE)

```

As we were expecting the the best result was using the "ward.D" linkage as the default R `hclust` performs a bottom-up algorithm.


##### Plotting Dendrogram

This dendrogram takes too much to be plotted.

::: {.panel-tabset}

###### stats

```{r}
#| eval: false

plot(ames_hclust)
abline(h = 800, col = "red")
```

###### dendextend

```{r}
#| eval: false

as.dendrogram(ames_hclust) |>
  color_branches(h = 800) |>
  plot()
```

###### factoextra

```{r}
#| eval: false

fviz_dend(ames_hclust, k = 2)
```

![](img/94-ames-dendrogram.png){fig-align="center"}

:::

##### Getting clusters

To get the clusters we have 2 alternatives:

- Defining the number of clusters to export.

```{r}
cutree(ames_hclust, k = 2) |> table() |> prop.table()
```

- Defining the height where the tree should be cut, which returns groups where the **members of the created clusters** have an **euclidean distance** amongst each other *no greater than our cut height* if where are using the **complete linkage**.

```{r}
cutree(ames_hclust, h = 400) |> table() |> prop.table()
```


### Partitioning around medians (PAM)

It has the same algorithmic steps as k-means but uses the **median** rather than the mean to determine the centroid; making it more robust to outliers. 

As your data becomes more sparse the performance of k-means and hierarchical clustering become *slow* and *ineffective*. An alternative is to use the **Gower distance**,  which applies a particular distance calculation that works well for each data type.

- **quantitative (interval)**: range-normalized Manhattan distance.
- **ordinal**: variable is first ranked, then Manhattan distance is used with a special adjustment for ties.
- **nominal**: variables with $k$ categories are first converted into $k$ binary columns (i.e., one-hot encoded) and then the **Dice coefficient** is used. To compute the dice metric for two observations $(X,Y)$ the algorithm looks across all one-hot encoded categorical variables and scores them as:
  - **a** — number of dummies 1 for both observations
  - **b** — number of dummies 1 for $X$ and 0 for $Y$
  - **c** - number of dummies 0 for $X$ and 1 for $Y$
  - **d** — number of dummies 0 for both
and then uses the following formula:

$$
D = \frac{2a}{2a + b +c}
$$

```{r}
# Original data minus Sale_Price
ames_full <- 
  AmesHousing::make_ames() |> 
  subset(select = -Sale_Price)

# Compute Gower distance for original data
gower_dst <- daisy(ames_full, metric = "gower")

# You can supply the Gower distance matrix to several clustering algos
pam_gower <- pam(x = gower_dst, k = 8, diss = TRUE)
```


### Clustering large applications (CLARA)

It uses the same algorithmic process as PAM; however, instead of finding the medoids for the entire data set it considers a small sample size and applies k-means or PAM.

```{r}
clara(my_basket, k = 10)
```


### Selecting number of clusters

As $k$ increases the **homogeneity** between observations in each cluster, but it also increases the risk to overfit the model, but it is important to know that **there is not a definitive answer to this question**. Some possibilities are:

- Defining $k$ based on **domain knowledge** or resources **limitations**. For example, you want to divide customers in 4 groups as you only 4 employees to execute the plan.

- **Optimizing a criterion**

  - The ***elbow method*** tries to minimize the total intra-cluster variation (within-cluster sum of square, WSS) by selecting the number of clusters so that ***adding another cluster doesn’t improve much better the total WSS***.
  
https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/

::: {.panel-tabset group="algorithm"}

#### HC

```{r}
#| eval: false
fviz_nbclust(ames_scale,
             FUNcluster = hcut,
             method = "wss",
             hc_method = "ward.D") +
  labs(subtitle = "Elbow method")
```

![](img/103-ames-hcut-Elbow.png){fig-align="center" width=90% height=90%}


#### K-means

```{r}
#| eval: false
fviz_nbclust(ames_scale,
             FUNcluster = kmeans,
             method = "wss") +
  labs(subtitle = "Elbow method")
```

![](img/104-ames-kmeans-Elbow.png){fig-align="center" width=90% height=90%}

:::
  
  - The ***silhouette method*** allows you to calculate how similar each observations is with the cluster it is assigned relative to other clusters.. The optimal number of clusters $k$ is the one that ***maximize the average silhouette over a range of possible values for ***$k$.
    - Values close to **1** suggest that the observation is well matched to the assigned cluster.
    - Values close to **0** suggest that the observation is borderline matched between two clusters.
    - Values close to **-1** suggest that the observations may be assigned to the wrong cluster
  
::: {.panel-tabset group="algorithm"}

#### HC

```{r}
#| eval: false

fviz_nbclust(ames_scale,
             FUNcluster = hcut,
             method = "silhouette",
             hc_method = "ward.D") +
  labs(subtitle = "Silhouette method")
```

![](img/105-ames-hcut-silhouette.png){fig-align="center" width=90% height=90%}

#### K-means

```{r}
#| eval: false
fviz_nbclust(ames_scale,
             FUNcluster = kmeans,
             method = "silhouette") +
  labs(subtitle = "Silhouette method")
```

![](img/106-ames-kmeans-silhouette.png){fig-align="center" width=90% height=90%}

:::

- **Gap statistic**: Compares the total within intra-cluster variation for different values of $k$ with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic, so the clustering structure would be **far away from the random uniform distribution of points**.

$$
\text{Gap}(k) = \frac{1}{B} \sum_{b=1}^B \log(W^*_{kb}) - \log(W_k)
$$

::: {.panel-tabset group="algorithm"}

#### HC

```{r}
#| eval: false
set.seed(123)
fviz_nbclust(ames_scale,
             FUNcluster = hcut,
             method = "gap_stat",
             hc_method = "ward.D",
             nboot = 50,
             verbose = FALSE) +
  labs(subtitle = "Gap statistic method")
```

![](img/107-ames-hcut-gap.png){fig-align="center" width=90% height=90%}

#### K-means

```{r}
#| eval: false
set.seed(123)
fviz_nbclust(ames_scale,
             FUNcluster = kmeans,
             method = "gap_stat",
             nboot = 50,
             nstart = 25,
             verbose = FALSE)+
  labs(subtitle = "Gap statistic method")
```

![](img/108-ames-kmeans-gap.png){fig-align="center" width=90% height=90%}

:::

- As there there are more than 30 indices and methods to we can compute all them in order to decide the best number of clusters using the **majority rule**


## Exploratory Factor Analysis

When running the EFA we get two important results

- **Variable' factor loadings**: Quantify the relationship (correlation) between each factor and each feature.

- **Observations' factor scores**: Estimates much of each factor each observation present.


### Coding Example

1. Loading libraries to use

```{r}
library(ade4)
library(GPArotation)
library(psych)

data(olympic)
OlympicDt <- as.data.table(olympic$tab)

extract_fa_summary <- function(x){
  
  as.data.table(
    # The [] are really good trick 
    # to transform object class from loadings to matrix
    x$loadings[],
    keep.rownames = "col_names"
  )[, `:=`(complexity = x$complexity,
           communalities = x$communalities,
           uniquenesses = x$uniquenesses)][]
  
}
```

2. Run the default EFA

```{r}
FaDefault <- fa(
  OlympicDt,
  nfactors = 2,
  rotate = "none"
)

fa.diagram(FaDefault)

FaDefaultSummary <- extract_fa_summary(FaDefault)

FaDefaultSummary
```

As MR1 shows some negative correlations as for those exports higher values represent bad results.

```{r}
library(ggplot2)

as.data.table(FaDefault$scores
)[, id := .I
][, melt(.SD, id.vars = "id", variable.factor = FALSE)] |>
ggplot(aes(value))+
  stat_ecdf(aes(color = variable))
```


3. Let's change the sign in columns were the variable is measuring time.

```{r}
NegCors <- FaDefaultSummary[MR1 < 0, col_names]

OlympicDtNeg <-
  copy(OlympicDt)[, (NegCors) := lapply(.SD, `-`),
                  .SDcols = NegCors]

FaNeg <- fa(
  OlympicDtNeg,
  nfactors = 2,
  rotate = "none"
)

fa.diagram(FaNeg)

FaNegSummary <- extract_fa_summary(FaNeg)

FaNegSummary[order(-MR1)]

```

Now as all items have positive correlations we know that the first factor is measuring good physical condition.

```{r}
FaNegSummary[order(-MR2)]
```

After arranging the second factor we can see that we have positive correlations with exports that need strength and the negative items are more related to running.

4. Rotate the factors to get a different interpretation.

```{r}
FaRotated <- fa(
  OlympicDtNeg,
  nfactors = 2,
  rotate = "oblimin"
)

FaRotatedSummary <- extract_fa_summary(FaRotated)

FaRotatedSummary
```

After checking the results we can have the next conclusions:

  - First factor is correlated with running exports from `400` to `long` as they complexity is close to 1.
  
```{r}
FaRotatedSummary[order(-MR1)]
```
  
  - The second factor is related to strength exports from `poid` to `jave`.
  
```{r}
FaRotatedSummary[order(-MR2)]
```  
  
  - To be good at `perc` you need to be good at running and have enough strength, having strength might makes harder to run `1500` and the current model doesn't explain `haut`.
  
```{r}
FaRotatedSummary[order(-complexity)]
```  

5. Adding other factor.

```{r}
FaRotated3 <- fa(
  OlympicDtNeg,
  nfactors = 3,
  rotate = "oblimin"
)

FaRotated3Summary <- extract_fa_summary(FaRotated3)

FaRotated3Summary[order(-MR3)]

```

The new third factor is related to `1500`, the `long` and `400` factors are now related to `MR1` and `MR3`.

6. Confirm your results by creatinga  chart.

```{r}
merge(x = FaRotatedSummary,
      y = FaRotated3Summary,
      by = "col_names",
      suffixes = c("_2F", "_3F")) |>
  ggplot(aes(MR1_2F, MR1_3F))+
  geom_blank(aes(MR1_2F*1.1, MR1_3F*1.1)) +
  geom_point(aes(size = MR3,
                 color = abs(MR2_2F - MR2_3F))) +
  geom_abline() +
  geom_text(aes(label = col_names),
            vjust = -1) +
  scale_x_continuous(labels = scales::percent_format(),
                     breaks = scales::breaks_width(0.2)) +
  scale_y_continuous(labels = scales::percent_format(),
                     breaks = scales::breaks_width(0.2)) +
  scale_size(labels = scales::percent_format())+
  scale_color_gradient(low = "blue",
                       high = "red") +
  labs(title = "Results After Adding\nThe Third Factor",
       x = "Original Factor 1",
       y = "New Factor 1",
       color = "Factor 2 Change",
       size = "Factor 3") +
  theme(plot.title = element_text(hjust = 0.5, 
                                  face = "bold",
                                  size = 16),
        legend.position = "top")
```


### Finding clusters patterns

Para find cluster's patterns we have 3 alternatives:

1. Using summary statistics by feature to compare groups with groups or one cluster the rest of the data.

2. Using PCA components to visualize clusters.

3. Plotting 2 dimensional plots each time with some dinamyc color or alpha.