Unsupervised learning task
Group sets of objects that share similar characteristic
Good clustering:
Observations in one cluster are highly similar
Observations in different clusters are dissimilar
Categorization of clustering methods
Hierarchicals
Agglomerative
Divisive
Non-hierarchical
Eg: \(k\)-means
Distance measures
Distance between two records
Distance between two clusters
Non-negative: \(d_{ij} \geq 0\)
Self-proximity: \(d_{ii} = 0\) (distance to itself is zero)
Symmetry: \(d_{ij} = d_{ji}\)
Triangle inequality: \(d_{ij} \leq d_{ik} + d_{kj}\)
\(d_{ij}\) - distance metric/ dissimilarity measure between \(i\) and \(j\) records
For \(i\) record we have \(p\) measurements \((x_{i1}, x_{i2}, ..., x_{ip})\) and for \(j\) record we have \(p\) measurements \((x_{j1}, x_{j2}, ..., x_{jp})\)
Euclidean Distance
Scale dependent
Changing the units of one variable have a huge influence on the results
\[d_{ij} = \sqrt{ (x_{i1} - x_{j1})^2 + (x_{i2} - x_{j2})^2 + ...+ (x_{ip} - x_{jp})^2}\]
\[r_{ij} = \frac{\sum_{m=1}^p (x_{im} - \bar{x}_i)(x_{jm} - \bar{x}_j)}{\sqrt{\sum_{m=1}^p (x_{im} - \bar{x}_i)^2 \sum_{m=1}^p (x_{jm} - \bar{x}_j)^2}}\]
\[d_{ij} = 1-r^2_{ij}\]
\[d_{ij} = |x_{i1}-x_{j1}| + |x_{i2}-x_{j2}|+...+|x_{ip}-x_{jp}|\]
\[d_{ij} = max_{(m=1, 2, ..p)}|x_{im} - x_{jm}|\]
\[d_{ij} = \sqrt{(\mathbf{x_i}-\mathbf{x_j})'S^{-1}(\mathbf{x_i}-\mathbf{x_j})}\]
\(\mathbf{x_i}\) and \(\mathbf{x_j}\) are \(p\)-dimensional vectors of the mesurements values for records \(i\) and \(j\), respectively; \(S\) is the covariance matrix for these vectors.
Example data
#define four vectors
a <- c(12, 14, 4, 6)
b <- c(5, 4, 6, 3)
c <- c(9, 6, 9, 7)
d <- c(10, 12, 3, 13)
mat <- rbind(a, b, c, d)
mat
## [,1] [,2] [,3] [,4]
## a 12 14 4 6
## b 5 4 6 3
## c 9 6 9 7
## d 10 12 3 13
dist(mat, method="euclidean")
## a b c
## b 12.727922
## c 9.949874 6.708204
## d 7.615773 14.071247 10.440307
dist(mat, method="manhattan")
## a b c
## b 22
## c 17 13
## d 12 26 19
dist(mat, method="maximum")
## a b c
## b 10
## c 8 4
## d 7 10 6
0 | 1 | ||
---|---|---|---|
0 | a | b | a+b |
1 | c | d | c+d |
a+c | b+d | n |
\[\frac{a+d}{n}\]
\[\frac{d}{(b+c+d)}\]
Consider cluster \(A\), which includes the \(m\) records \(A_1, A_2,...A_m\) and Cluster B, which includes \(n\) records \(B_1, B_2, ...B_n\).
\[min(distance(A_i, B_j)), \text{ } i= 1, 2, ...m; \text{ } j=1, 2, ...n\]
\[max(distance(A_i, B_j)), \text{ } i= 1, 2, ...m; \text{ } j=1, 2, ...n\]
\[Average(distance(A_i, B_j)), \text{ } i= 1, 2, ...m; \text{ } j=1, 2, ...n\]
\[distance(\bar{x}_A, \bar{x}_B)\]
Start with each cluster comprising exactly one record (number of observations = number of clusters)
At every step, the two clusters with the smallest distance are merged.
Repeatedly links pairs of clusters until every data object is included in the hierarchy ( until there is only one cluster left at the end or a specified termination condition is satisfied).
Single linkage: minimum distance
Complete linkage: maximum distance
Average linkage: average distance (between all pairs of records)
Centroid linkage: centroid distance (clusters are represented by their mean values for each variable, which forms a vector of means)
Ward’s method: Consider “loss of information” that occurs when records are clustered together.
Further reading and source: https://livebook.manning.com/book/machine-learning-for-mortals-mere-and-otherwise/chapter-17/v-7/65
Tree-like diagram that summarizes the process of clustering.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6.9000 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
data('USArrests')
USArrests
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
## Connecticut 3.3 110 77 11.1
## Delaware 5.9 238 72 15.8
## Florida 15.4 335 80 31.9
## Georgia 17.4 211 60 25.8
## Hawaii 5.3 46 83 20.2
## Idaho 2.6 120 54 14.2
## Illinois 10.4 249 83 24.0
## Indiana 7.2 113 65 21.0
## Iowa 2.2 56 57 11.3
## Kansas 6.0 115 66 18.0
## Kentucky 9.7 109 52 16.3
## Louisiana 15.4 249 66 22.2
## Maine 2.1 83 51 7.8
## Maryland 11.3 300 67 27.8
## Massachusetts 4.4 149 85 16.3
## Michigan 12.1 255 74 35.1
## Minnesota 2.7 72 66 14.9
## Mississippi 16.1 259 44 17.1
## Missouri 9.0 178 70 28.2
## Montana 6.0 109 53 16.4
## Nebraska 4.3 102 62 16.5
## Nevada 12.2 252 81 46.0
## New Hampshire 2.1 57 56 9.5
## New Jersey 7.4 159 89 18.8
## New Mexico 11.4 285 70 32.1
## New York 11.1 254 86 26.1
## North Carolina 13.0 337 45 16.1
## North Dakota 0.8 45 44 7.3
## Ohio 7.3 120 75 21.4
## Oklahoma 6.6 151 68 20.0
## Oregon 4.9 159 67 29.3
## Pennsylvania 6.3 106 72 14.9
## Rhode Island 3.4 174 87 8.3
## South Carolina 14.4 279 48 22.5
## South Dakota 3.8 86 45 12.8
## Tennessee 13.2 188 59 26.9
## Texas 12.7 201 80 25.5
## Utah 3.2 120 80 22.9
## Vermont 2.2 48 32 11.2
## Virginia 8.5 156 63 20.7
## Washington 4.0 145 73 26.2
## West Virginia 5.7 81 39 9.3
## Wisconsin 2.6 53 66 10.8
## Wyoming 6.8 161 60 15.6
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
# Normalize 0-1 datasets:
df <- USArrests %>% mutate_all(function(x) {(x - min(x)) / (max(x) - min(x))})
# Set rowname:
row.names(df) <- row.names(USArrests)
# Compute distances:
dd <- dist(df, method = "euclidean")
# Visualize the dissimilarity:
fviz_dist(dd, lab_size = 7)
# Perform hierarchical clustering:
hc1 <- hclust(dd, method = "single")
hc1
##
## Call:
## hclust(d = dd, method = "single")
##
## Cluster method : single
## Distance : euclidean
## Number of objects: 50
plot(hc1, hang=-1, ann=FALSE)
# Create a function of dendrogram:
dend_func <- (function(x) {fviz_dend(x,
k = 4,
cex = 0.5,
rect = TRUE,
rect_fill = TRUE,
horiz = FALSE,
palette = "jco",
rect_border = "jco",
color_labels_by_k = TRUE) })
dend_func(hc1) -> basic_plot
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
basic_plot + theme_gray() +
theme(plot.margin = unit(rep(0.7, 4), "cm")) +
labs(title = "Hierarchical Clustering with Single Linkage Method")
clust_func<-(function(x){fviz_cluster(list(data = df, cluster = paste0("Group", x)),
alpha = 1,
colors = x,
labelsize = 9,
ellipse.type = "norm")})
sgroup<- cutree(hc1, k = 4)
USArrests$sgroup <- factor(sgroup)
head(USArrests)
## Murder Assault UrbanPop Rape sgroup
## Alabama 13.2 236 58 21.2 1
## Alaska 10.0 263 48 44.5 2
## Arizona 8.1 294 80 31.0 1
## Arkansas 8.8 190 50 19.5 1
## California 9.0 276 91 40.6 1
## Colorado 7.9 204 78 38.7 1
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggplot(USArrests, aes(x=Murder, y=Assault, color = sgroup)) + geom_point()
ggplot(USArrests, aes(x=Murder, y=UrbanPop, color = sgroup)) + geom_point()
hc2 <- hclust(dd, method = "average")
plot(hc2, hang=-1, ann=FALSE)
dend_func(hc2) -> basic_plot2
basic_plot2 + theme_gray() +
theme(plot.margin = unit(rep(0.7, 4), "cm")) +
labs(title = "Hierarchical Clustering with Average Linkage Method")
hc3 <- hclust(dd, method = "ward.D2")
plot(hc3, hang=-1, ann=FALSE)
dend_func(hc3) -> basic_plot3
basic_plot3 + theme_gray() +
theme(plot.margin = unit(rep(0.7, 4), "cm")) +
labs(title = "Hierarchical Clustering with Ward Linkage Method")
sgroup3<- cutree(hc3, k = 4)
USArrests$sgroup3 <- factor(sgroup3)
head(USArrests)
## Murder Assault UrbanPop Rape sgroup sgroup3
## Alabama 13.2 236 58 21.2 1 1
## Alaska 10.0 263 48 44.5 2 2
## Arizona 8.1 294 80 31.0 1 2
## Arkansas 8.8 190 50 19.5 1 3
## California 9.0 276 91 40.6 1 2
## Colorado 7.9 204 78 38.7 1 2
library(GGally)
ggpairs(USArrests, aes(col=sgroup3))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Cluster stability
Cluster separation
Number of clusters
Select k clusters arbitrarily.
Initialize cluster centers with those k clusters.
Do loop
Partition by assigning or reassigning all data objects to their closest cluster center.
Compute new cluster centers as mean value of the objects in each cluster.
Until no change in cluster center calculation
data(USArrests)
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(USArrests), method = "number",
type = "lower")
USArrests <- scale(USArrests)
dim(USArrests)
## [1] 50 4
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
dist.eucl <- dist(USArrests, method = "euclidean")
head(dist.eucl)
## [1] 2.703754 2.293520 1.289810 3.263110 2.651067 3.215297
fviz_dist(dist.eucl)
km.res <- kmeans(USArrests, 4, nstart = 20)
km.res
## K-means clustering with 4 clusters of sizes 13, 8, 16, 13
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 0.6950701 1.0394414 0.7226370 1.27693964
## 2 1.4118898 0.8743346 -0.8145211 0.01927104
## 3 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 4 -0.9615407 -1.1066010 -0.9301069 -0.96676331
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 2 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 3 3 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 4 1 3 4
## Kansas Kentucky Louisiana Maine Maryland
## 3 4 2 4 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 1 4 2 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 4 4 1 4 3
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 2 4 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 2
## South Dakota Tennessee Texas Utah Vermont
## 4 2 1 3 4
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 4 4 3
##
## Within cluster sum of squares by cluster:
## [1] 19.922437 8.316061 16.212213 11.952463
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
df_member <- cbind(USArrests, cluster = km.res$cluster)
head(df_member)
## Murder Assault UrbanPop Rape cluster
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473 2
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941 1
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388 1
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602 2
## California 0.27826823 1.2628144 1.7589234 2.067820292 1
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207 1
fviz_cluster(km.res, data = USArrests,
palette=c("red", "blue", "black", "darkgreen"),
ellipse.type = "euclid",
star.plot = T,
repel = T,
ggtheme = theme())
dist(km.res$centers)
## 1 2 3
## 2 2.117941
## 3 2.411241 2.684572
## 4 3.887882 3.246984 1.874055
Elbow method
Silhouette method
Gap statistic
Exercise:
Cluster districts according to dengue cases
`r library(mozzie)