BACON
is an R package for landmark-based Bayesian clustering of closed polygonal chains, relying on intrinsic shape features (i.e. proportions of interior angles and side lengths). The algorithm accepts the coordinates of closed polygonal chains, extracts the aforementioned inherent shape features, and clusters them using a Gibbs sampler.
Install the package by one of the following methods:
# Install from The Comprehensive R Archive Network (CRAN)
install.packages("BACON")
# Install from GitHub
if (!require("devtools")) install.packages("devtools")
devtools:install_github("kevinwjin/BACON")
To simulate shape data for clustering, begin by adding the relevant functions to your R environment. The following file contains all functions necessary for shape data simulation:
source("~/BACON/code/data_simulation/functions.R")
Next, generate some simulated shape data. Our data simulation code generates the Cartesian coordinates of the vertices of a shape. As an example, we will simulate 1000 20-gons belonging to 10 evenly-spaced clusters. Each cluster will have 100 20-gons within it:
dataset <- simulate_shapes(x = 1000, # 1000 shapes total
z = 10, # 10 clusters
n = 100, # 100 shapes per cluster
k = 20, # Each shape is a 20-gon
jitter_factor = 0.01) # Each cluster has 0.01 jitter
As a sanity check, you may visualize the simulated shapes by plotting the coordinates of each shape in the dataset:
for (i in seq_along(dataset)) {
for (shape in dataset[[i]]) {
plot(shape, type = "l")
}
}
We cannot cluster the raw coordinate data, as it is sensitive to geometric
transformations; therefore, we must convert the raw coordinate data to
normalized compositional data by extracting the intrinsic
transformation-invariant shape features (interior
angles and side lengths, both normalized to 1) that our model will be able to
cluster. To do this, call get_interior_angles()
and get_side_lengths()
, and
store the resultant interior angle and side length proportions into new matrices:
angles <- matrix(nrow = x, ncol = k, byrow = TRUE)
side_lengths <- matrix(nrow = x, ncol = k, byrow = TRUE)
counter <- 1
for (i in seq_along(dataset)) {
for (j in dataset[[i]]) {
angles[counter, ] <- get_interior_angles(j)
side_lengths[counter, ] <- get_side_lengths(j)
counter <- counter + 1
}
}
# Clean up variables
rm(i, j)
Finally, we finish the simulated dataset by generating the ground truth containing the cluster labels:
ground_truth <- rep(1:z, each = n)
A template for the above procedure is provided in data_simulation_demo.Rmd
.
To use BACON, begin by sourcing the following file into your R environment.
# Source the clustering function into R environment
setwd("~/Documents/Repositories/BACON/code/clustering/")
source("bacon.R")
Call the clustering function bacon()
, passing it the following arguments:
side_lengths
: n x k matrix containing n samples of k-gon side length proportions (required)angles
: n x k matrix containing n samples of k-gon angle proportions (required)K
: Number of clusters a priori (required)weight_L
: Numerical weight of the contribution of the side length proportions to the mixture model (tuning parameter [0, 1] differing between datasets; default is 1)weight_A
: Numerical weight of the contribution of the angle proportions to the mixture model (tuning parameter [0, 1] differing between datasets; default is 1)estimate.s
: Whether to estimate thes
parameter for shape registration (default isTRUE
)estimate.r
: Whether to estimate ther
parameter for shape registration (default isTRUE
)iter
: Number of MCMC iterations (default is 2000)burn
: Number of MCMC iterations for burn-in (default isiter
/2, or 1000)
# Execute the clustering function and save results
res <- bacon(side_lengths, angles, K, weight_L, weight_A, estimate.s, estimate.r, iter, burn)
Finally, evaluate the clustering accuracy by calculating the ARI between the estimated clusters and the ground truth. We will do this by calling the the adjustedRandIndex()
function from the mclust
package.
# Return the ARI of BACON-derived clusters and the ground truth
mclust::adjustedRandIndex(res$cluster, ground_truth)
code/clustering/
- Code for shape clusteringcode/data_simulation/
- Code for simulating data for model testingcode/model_testing/
- Code for model testing on datadata/
- Simulated and real datasetsfigures/
- Figures from model testing
Rcpp
- C++ implementation of the Markov chain Monte Carlo (MCMC) algorithmRcppArmadillo
- Fast matrix operations (requires GNU Fortran library)RcppEigen
- Fast matrix operationsRcppDist
- Call probability distributions from within C++MCMCpack
- Required byrdirichlet()
functionmcclust
- MCMC clustering sample processingsf
- Spatial point detection for interior angle calculationmclust
- Implementation of Gaussian mixture model for comparison of clustering methods