1 Supervised Learning: Classification Analysis
- 1.1 K-Nearest Neighbor (KNN)
2 Unsupervised Learning
- 2.1 Clustering Analysis: K-means clustering
- 2.2 Hierarchical clustering

1 Supervised Learning: Classification Analysis

The outcome \(Y\) is available (also called dependent variable, response in statistics). Then, a set of predictors, regressors, covariates, features, or independent variables.

There are two main types of problems, regression problem and classification problem.

1.1 K-Nearest Neighbor (KNN)

In order to demonstrate this simple machine learning algorithm, I use Iris dataset, a famous dataset for almost all machine learning courses, and apply KNN onto the dataset to train a classifier for Iris Species.

1.1.1 Load and prepare the Iris dataset

Before start, always do

set the working directory!
create a new R script (unless you are continuing last project)
Save the R script.

Let’s first load the Iris dataset. This is a very famous dataset in almost all data mining, machine learning courses, and it has been an R build-in dataset. The dataset consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginicaand Iris versicolor). Four features(variables) were measured from each sample, they are the length and the width of sepal and petal, in centimeters. It is introduced by Sir Ronald Fisher in 1936.

3 Species

Four features of flower: length and the width of sepal and petal

The iris flower data set is included in R. It is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

First, load iris data to the current workspace

data(iris)
iris

What is in the dataset? You can use head() or tail() to print the first or last few rows of a dataset:

head(iris)

Check dimensionality, the dataset has 150 rows(observations) and 5 columns (variables)

dim(iris)

## [1] 150   5

Variable names or column names

names(iris); # or colnames(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Structure of the dataframe, note that the difference between num and Factor

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

By default, R treat strings as factors (or categorical variables), in many situations (for example, building a regression model) this is what you want because R can automatically create “dummy variables” from the factors. However when merging data from different sources this can cause errors. In this case you can use stringsAsFactors = FALSE option in read.table.

class(iris[,1])

## [1] "numeric"

class(iris[,5])

## [1] "factor"

Simple summary statistics

Try the summary() function.

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

data("iris")
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Suppose we use the first 30 observations of each flower as the training sample and the rest as testing sample.

setosa <- rbind(iris[iris$Species=="setosa",])
versicolor <- rbind(iris[iris$Species=="versicolor",])
virginica <- rbind(iris[iris$Species=="virginica",])
ind <- 1:30
iris_train <- rbind(setosa[ind,], versicolor[ind,], virginica[ind,])
iris_test <- rbind(setosa[-ind,], versicolor[-ind,], virginica[-ind,])

Exercise: (HW 1) Random sample a training data set that contains 80% of original data points.

1.1.2 Train the model

In R, knn() function is designed to perform K-nearest neighbor. It is in package "class".

install.packages("class")

library(class)
knn_iris <- knn(train = iris_train[, -5], test = iris_test[, -5], cl=iris_train[,5], k=5)

Here, the function knn() requires at least 3 inputs (train, test, and cl), the rest inputs have defaut values. train is the training dataset without label (Y), and test is the testing sample without label. cl specifies the label of training dataset. By default \(k=1\), which results in 1-nearest neighbor.

1.1.3 Prediction accuracy

Here I use test set to create contingency table and show the performance of classifier.

table(iris_test[,5], knn_iris, dnn = c("True", "Predicted"))

##             Predicted
## True         setosa versicolor virginica
##   setosa         20          0         0
##   versicolor      0         19         1
##   virginica       0          0        20

sum(iris_test[,5] != knn_iris)

## [1] 1

go to top

2 Unsupervised Learning

If you only have a set of features (predictors) measured on a set of samples(observations), and you do not have a outcome variable, you are dealing with unsupervised learning problems.

In this case, your objective could be very different. You may want to:

find groups of samples that have similar behariors (clustering analysis);
find linear combinations of feature that explain most of the variation(PCA: Principal Components Analysis).

It can be an useful pre-processing step for you to obtain labels for the supervised learning.

2.1 Clustering Analysis: K-means clustering

K-means clustering with 5 clusters, the ‘fpc’ package provides the ‘plotcluster’ function. You need to run install.packages('fpc') to install it first.

install.packages("fpc")

library(fpc)

fit <- kmeans(iris[,1:4], 5)
plotcluster(iris[,1:4], fit$cluster)

The first argument of the kmeans function is the dataset that you wish to cluster, that is the column 1-4 in the iris dataset, the last column is true category the observation so we do not include it in the analysis; the second argument 5 indicates that you want a 5-cluster solution. The result of the cluster analysis is then assigned to the variable fit, and the plotcluster function is used to visualize the result.

Do you think it is a good solution? Try it with 3 cluster.

kmeans_result <- kmeans(iris[,1:4], 3)
plotcluster(iris[,1:4], kmeans_result$cluster)

2.2 Hierarchical clustering

hc_result <- hclust(dist(iris[,1:4]))
plot(hc_result)
#Cut Dendrogram into 3 Clusters
rect.hclust(hc_result, k=3)

There are three things happened in the first line. First dist(iris[, 1:4]) calculates the distance matrix between observations (how similar the observations are from each other judging from the 4 numerical variables). Then hclust takes the distance matrix as input and gives a hierarchical cluster solution. At last the solution is assigned to the variable hc_result. In hierarchical clustering you do not need to give the number of how many clusters you want, it depends on how you cut the dendrogram.