Supervised Learning

Labeled data
The goal is to predict or explain certain outcome
Types of Data Mining problems:
- Regression: outcome is continuous
- Classification: outcome is categorical
Popular ML algorithms:
- Least square, nearest neighbor, CART, gradient boosting, neural network, deep learning

A simple linear regression

The estimated linear regression model is

\[ \textit{Expected Housing Price} = -34.7 + 9.1 \times \textit{Number of Room} \]

In general

Linear regression model

\[ Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon \]

The estimated model is

\[ \hat{Y} = \hat{f}(\mathbf{x}) = \hat{\boldsymbol{\beta}}^{\top}\mathbf{x} \]

From Continuous to Categorical Outcome

Classification – an illustration

Source: ISLR, p. 38, Figure 2.13

Classification Methods

K-Nearest Neighbor
Logistic regression
Classification tree
Random forest
Boosted tree
Support vector machine
Neural networks
Deep learning
…

Why Not Linear Regression

\[ \hat{p}= -1.5+2X_1-X_2 \]

Example: default prediction
- Default (\(Y=1\)) vs. Nondefault (\(Y=0\))
- \(X_1\): credit card balance level, \(X_2\): income level
Suppose the linear probability model (LPM):
What is the predicted value if a person’s balance level is 1 and income level is 3?

Classification

Threshold	<0.1	>0.1
Class	Nondefault	Default

Most DM algorithms produce probabilistic outcome
- e.g. probability that X belongs to each class
Classification is based on certain decision rules

Classification: Logistic Regression

\[ \mathbb{P}(y_i=1|\mathbf{x}_i) = \frac{1}{1+\exp(-\boldsymbol \beta^T \mathbf{x}_i)} \]

For binary response:
- Sigmoid function: \(s(u)=\frac{1}{1+e^{-u}}\)
- Interpretation: **probability} of event conditional on \(X\)
More than two classes: *multinomial logistic model}

Prediction — From Probability to Class

Direct outcome of model: probability
Next step: classification
Need decision rule (cut-off probability – p-cut)

Confusion Matrix

	Pred=1	Pred=0
True=1	True Positive (TP)	False Negative (FN)
True=0	False Positive (FP)	True Negative (TN)

Classification table based on a specific cut-off probability
Used for model assessment
FP: type I error; FN: type II error
{ Different p-cut results in different confusion matrix}

Some Useful Measures

Misclassification rate (MR) = \(\frac{\text{FP+FN}}{\text{Total}}\)
True positive rate (TPR) = \(\frac{\text{TP}}{\text{TP+FN}}\): Sensitivity or Recall
True negative rate (TNR) = \(\frac{\text{TN}}{\text{FP+TN}}\): Specificity
False positive rate (FPR) = \(\frac{\text{FP}}{\text{FP+TN}}\): 1 – Specificity

ROC Curve

Receiver Operating Characteristic
Plot of FPR (X) against TPR (Y) at various p-cut values
Overall model assessment (not for a particular decision rule)
Unique for a given model

AUC of ROC Curve

The overall performance of the model can also be assessed using the area under the curve (AUC) measure, which ranges from 0 (worst possible model) to 1 (perfect model).

The diagonal line has an AUC value of .5.

Classification: K-nearest neighbor

Idea:

It uses observations from the past that are the most similar to new observations to classify or predict the target variable.
KNN identifies a number of existing observations that are most similar to the new observation for classification.
1st component: similarity measure (typically standardized and Euclidean distance).
- How to define the neighbor?
- How to find closest points to \(\mathbf{x}_i\)?
2nd component: Number of nearest neighbors to consider, k.

Distance-based Similarity Measures

What is the impact of \(k\) – an example

Source: ESL, pp. 15–16

K-nearest neighbor algorithm

Step 1: Selecting the optimal value of K: number of nearest neighbors
Step 2: Calculating distance: Euclidean distance is widely used.
Step 3: Finding Nearest Neighbors
- The k data points with the smallest distances to the target point are nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression

For example, to classify an email as spam or not spam, the K-Nearest Neighbors (KNN) algorithm examines the K most similar emails in the dataset.

It then checks the categories of these neighboring emails and assigns the new email to the category that appears most frequently, a process known as majority voting.

Clustering – an unsupervised learning method

Goal: find subgroups of a sample observations
- Not based on any single variable (e.g. gender, race)
- Based on all given variables
The number of subgroup is subjective
Approaches:
- K-means clustering
- Hierarchical clustering
- Model-based clustering

Application: Clustering Iris flowers in `Iris` dataset

This is a very famous dataset in almost all data mining, machine learning courses, and it has been an R build-in dataset. The dataset consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginicaand Iris versicolor). Four features(variables) were measured from each sample, they are the length and the width of sepal and petal, in centimeters. It is introduced by Sir Ronald Fisher in 1936 with 3 Iris Species.

# Load the iris dataset
data("iris")
plot(x = iris$Petal.Length, y = iris$Sepal.Length,
     main = "Scatter Plot of Sepal Length vs Petal Length", # Plot title
     xlab = "Petal Length",                                # X-axis label
     ylab = "Sepal Length",                                # Y-axis label
     pch = 19                                             # Use solid circles for points
     )

An example – Iris data

K-means clustering – step-by-step

Run the following R code, and see what it does.

This is a random grouping (first step)

The next step

Run the following R code, and see what it does.

Data points are regrouped (second step)

How does this happen?

Let’s repeat the second step

Note that the code does not change at all. Why?

Data points are regrouped again

We can keep repeating this step, until…

Members in each cluster do not change, which means the algorithm converges. How can we translate it into some numeric scores?

Statistics behind k-means clustering

The algorithm attempts to:

Minimize variance within clusters
Maximize variance between clusters
Think about total variance too

Instead of computing variance, we compute sum squared error (SSE).

For a single variable \(X\), SSE is defined as

\[ SSE(X)= \sum_{i=1}^{n}(X_i-\bar{X})^2 \]

For multiple variables \(\mathbf{X}=(X_1, \ldots, X_p)\), SSE is defined as

\[ SSE(\mathbf{X}) = \sum_{i=1}^{n}\sum_{j=1}^{p}(X_{ij}-\bar{X}_j)^2 \]

Statistics behind k-means clustering

Computing \(SSE(\mathbf{X})\) for the entire sample gives total sum squared error (SST)
Computing \(SSE(\mathbf{X})\) for each cluster gives within-group sum squared error
Between-group sum square is the difference between total SS and the sum of within SS
Exercise: revisit the algorithm we just performed and compute the above three measures at the end of each step. How are they changing over iterations?

Recap

Here is a very good animation to illustrate the k-means clustering algorithm:

Visualizing K-means Clustering

K-means algorithm

Randomly find \(k\) data points (observations) as the initial centers
For each data point, find the closest center and label it. Now you have \(k\) clusters
Re-calculate the centers of current clusters
Repeat step 2 and step 3 until the centers do not change

Classification problems

Response variable is qualitative
Train a classifier \(\hat{C}(x)\) that can label any new input data \(x\)
It usually involves a decision rule
Prediction (testing) error: misclassification rate (MR)

Bias-Variance Tradeoff

This is a very important tradeoff that governs the choice of statistical learning methods
Bias: how far the estimator \(\bar{X}\) is to the true population mean \(\mu\)
- Unbiased estimator such as sample mean satisfies \(\mathbb{E}\bar{X}=\mu\)
Variance: the variation of sample mean \(\bar{X}\) is \(var(\bar{X}) = \sigma^2/n\)

Bias-Variance Tradeoff

Bias: how far the estimated model \(\hat{f}(x)\) is from the true model \(f(x)\)
- Unbiased estimate is defined as \(\mathbb{E}\hat{f}(x)=f(x)\)
- Usually, we calculate the squared bias: \((\mathbb{E}\hat{f}(x)-f(x))^2\)
Variance: the variation of estimated model \(\hat{f}(x)\) based on different training sets

Bias-Variance Tradeoff

Source: Elite Data Science

Bias-Variance Tradeoff

Suppose the data arise from a model \(Y=f(x)+\epsilon\), with \(\mathbb{E}(\epsilon)=0\) and \(Var(\epsilon)=\sigma^2\)
Suppose \(\hat{f}(x)\) is trained based on some training data, and let \((x_0, y_0)\) be a test observation from the same population
The expected prediction error can be decomposed to:

\[ \mathbb{E}[y_0-\hat{f}(x_0)]^2 = \sigma^2 + Bias^2(\hat{f}(x_0)) + Var(\hat{f}(x_0)) \]

Typically, as the flexibility of \(\hat{f}\) increases, its variance increases and its bias decreases

Model assessment (for supervised learning)

How do we know if the estimated model \(\hat{f}(x)\) is useful?

We never know the true \(f(x)\)
Split sample into training and testing sets
Train the model based on training sample by minimizing training error
Apply the estimated model on the testing sample to calculate the prediction (testing) error
We care more about testing error than training error

Accuracy illustration

K-fold cross-validation

Instead of doing training versus testing once, we do it \(K\) times

Use 2, 3, 4, 5 as training and 1 as testing
Use 1, 3, 4, 5 as training and 2 as testing
Keep doing this loop
Average the 5 testing errors; that is the CV score

Leave-one-out cross-validation

By the name, it requires repeating the training-testing procedure \(n\) times
However, for least square linear model, there is a shortcut that makes LOOCV the same as a single model fit

\[ CV_n=\frac{1}{n}\sum_{i=1}^{n}\left(\frac{y_i-\hat{y}_i}{1-h_i}\right)^2 \]

where \(h_i\) is the diagonal element of the hat matrix.

In general, the estimates from LOOCV are highly correlated, hence their average can have high variance
In practice, \(K=5\) or \(10\) is recommended

Summary

Supervised learning includes regression and classification
K-nearest neighbors relies on similarity and the choice of \(k\)
K-means clustering is an iterative unsupervised learning algorithm
Model assessment should focus on testing error
The bias-variance tradeoff helps explain model flexibility
Cross-validation is a practical tool for model selection and evaluation

go to top

Examples of Data Mining Algorithms and General Principles

Supervised Learning

A simple linear regression

In general

From Continuous to Categorical Outcome

Classification – an illustration

Classification Methods

Why Not Linear Regression

An Illustration

An Illustration

Classification

Classification: Logistic Regression

Prediction — From Probability to Class

Confusion Matrix

Some Useful Measures

ROC Curve

ROC Curve

AUC of ROC Curve

Classification: K-nearest neighbor

K-nearest neighbor

Distance-based Similarity Measures

What is the impact of \(k\) – an example

K-nearest neighbor algorithm

Clustering – an unsupervised learning method

Application: Clustering Iris flowers in Iris dataset

An example – Iris data

An example – Iris data

K-means clustering – step-by-step

This is a random grouping (first step)

The next step

Data points are regrouped (second step)

Let’s repeat the second step

Data points are regrouped again

We can keep repeating this step, until…

Statistics behind k-means clustering

Statistics behind k-means clustering

Recap

K-means algorithm

Classification problems

Bias-Variance Tradeoff

Bias-Variance Tradeoff

Bias-Variance Tradeoff

Bias-Variance Tradeoff

Model assessment (for supervised learning)

Prediction error

Accuracy illustration

K-fold cross-validation

Leave-one-out cross-validation

Summary

Application: Clustering Iris flowers in `Iris` dataset