1 Supervised Learning: Classification Analysis

The outcome \(Y\) is available (also called dependent variable, response in statistics). Then, a set of predictors, regressors, covariates, features, or independent variables.

There are two main types of problems, regression problem and classification problem.

1.1 K-Nearest Neighbor (KNN)

In order to demonstrate this simple machine learning algorithm, I use Iris dataset, a famous dataset for almost all machine learning courses, and apply KNN onto the dataset to train a classifier for Iris Species.

1.1.1 Load and prepare the Iris dataset

Before start, always do

  • set the working directory!
  • create a new R script (unless you are continuing last project)
  • Save the R script.

Let’s first load the Iris dataset. This is a very famous dataset in almost all data mining, machine learning courses, and it has been an R build-in dataset. The dataset consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginicaand Iris versicolor). Four features(variables) were measured from each sample, they are the length and the width of sepal and petal, in centimeters. It is introduced by Sir Ronald Fisher in 1936.

  • 3 Species