1 Simple Example beforehand: T-shirts’ Size

At the first, they don’t have these labels, they only have some information about customers. Let’s think about how tailor make your customer-tailored T-shirt. They may measure your neck width(collar), arm length, chest width, waistline and so on. But, for most apparel companies, they have to have as few as possible number of sizes so that they can save cost to cover most of their target customers. Let’s say they only want to have five sizes. So the problem is how to find these five sizes so that most of the customers can buy a comfortable one, and meanwhile, when they have the right size, the T-shirt is not too large or to small. In statistics, this problem is equivalent to finding five clusters based on provided information so that the variation within clusters is small, and between clusters variation is large.

go to top

2 Summary of Seeds data

We use the seeds data set to demonstrate cluster analysis in R. The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each. A description of the dataset can be viewed at (https://archive.ics.uci.edu/ml/datasets/seeds). Seven geometric values of wheat kernels were measured. Assume we only have the information of the seven (7) measures (x) and our task is to cluster or group the 210 seeds (so we remove the V8).

seed <- read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt', header=F)
seed <- seed[,1:7]
colnames(seed) <- c("area", "perimeter","campactness", "length", "width", "asymmetry", "groovelength")

Scale data to have zero mean and unit variance for each column:

seed <- scale(seed) 

go to top

3 K-means

The basic idea of k-means clustering is to define clusters then minimize the total intra-cluster variation (known as total within-cluster variation). The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid: \[W(C_k) = \sum_{x_i \in C_k}(x_i - \mu_k)^2,\] where:

For clustering, one can rely on all kinds of distance measures and it is critical point. The distance measures will show how similar two elements \((x, z)\) are and it will highly influence the results of the clustering analysis. The classical methods for distance measures are Euclidean and Manhattan distances, which are defined as follow:

Euclidean distance:

\(d_{euc}(x,z) = \sqrt{\sum^n_{i=1}(x_i - z_i)^2} \tag{1}\)

Manhattan distance:

\(d_{man}(x,z) = \sum^n_{i=1}|(x_i - z_i)| \tag{2}\)

Pearson correlation distance:

\(d_{cor}(x, z) = 1 - \frac{\sum^n_{i=1}(x_i-\bar x)(z_i - \bar z)}{\sqrt{\sum^n_{i=1}(x_i-\bar x)^2\sum^n_{i=1}(z_i - \bar z)^2}} \tag{3}\)

Before conducting K-means clustering, we can calculate the pairwise distances between any two rows (observations) to roughly check whether there are some observations close to each other. Specifically, we can use get_dist to calculate the pairwise distances (the default is the Euclidean distance). Then the fviz_dist will visualize a distance matrix generated from get_dist.

distance <- get_dist(seed)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))