Examples of Data Mining Algorithms and General Principles

Xiaorui (Jeremy) Zhu

03/09/2026

Supervised Learning

A simple linear regression

The estimated linear regression model is

\[ \textit{Expected Housing Price} = -34.7 + 9.1 \times \textit{Number of Room} \]

In general

\[ Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon \]

\[ \hat{Y} = \hat{f}(\mathbf{x}) = \hat{\boldsymbol{\beta}}^{\top}\mathbf{x} \]

From Continuous to Categorical Outcome

Classification – an illustration

Source: ISLR, p. 38, Figure 2.13

Classification Methods

Why Not Linear Regression

\[ \hat{p}= -1.5+2X_1-X_2 \]

An Illustration

An Illustration

Classification

Threshold <0.1 >0.1
Class Nondefault Default

Classification: Logistic Regression

\[ \mathbb{P}(y_i=1|\mathbf{x}_i) = \frac{1}{1+\exp(-\boldsymbol \beta^T \mathbf{x}_i)} \]

Prediction — From Probability to Class

Confusion Matrix

Pred=1 Pred=0
True=1 True Positive (TP) False Negative (FN)
True=0 False Positive (FP) True Negative (TN)

Some Useful Measures

ROC Curve

ROC Curve

AUC of ROC Curve

The overall performance of the model can also be assessed using the area under the curve (AUC) measure, which ranges from 0 (worst possible model) to 1 (perfect model).

Classification: K-nearest neighbor

Idea:

K-nearest neighbor

Distance-based Similarity Measures

What is the impact of \(k\) – an example

Source: ESL, pp. 15–16

K-nearest neighbor algorithm

For example, to classify an email as spam or not spam, the K-Nearest Neighbors (KNN) algorithm examines the K most similar emails in the dataset.

It then checks the categories of these neighboring emails and assigns the new email to the category that appears most frequently, a process known as majority voting.

Clustering – an unsupervised learning method

Application: Clustering Iris flowers in Iris dataset

This is a very famous dataset in almost all data mining, machine learning courses, and it has been an R build-in dataset. The dataset consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginicaand Iris versicolor). Four features(variables) were measured from each sample, they are the length and the width of sepal and petal, in centimeters. It is introduced by Sir Ronald Fisher in 1936 with 3 Iris Species.

# Load the iris dataset
data("iris")
plot(x = iris$Petal.Length, y = iris$Sepal.Length,
     main = "Scatter Plot of Sepal Length vs Petal Length", # Plot title
     xlab = "Petal Length",                                # X-axis label
     ylab = "Sepal Length",                                # Y-axis label
     pch = 19                                             # Use solid circles for points
     )

An example – Iris data

An example – Iris data

K-means clustering – step-by-step

Run the following R code, and see what it does.

This is a random grouping (first step)

The next step

Run the following R code, and see what it does.

Data points are regrouped (second step)

How does this happen?

Let’s repeat the second step

Note that the code does not change at all. Why?

Data points are regrouped again

We can keep repeating this step, until…

Members in each cluster do not change, which means the algorithm converges. How can we translate it into some numeric scores?

Statistics behind k-means clustering

The algorithm attempts to:

Instead of computing variance, we compute sum squared error (SSE).

\[ SSE(X)= \sum_{i=1}^{n}(X_i-\bar{X})^2 \]

\[ SSE(\mathbf{X}) = \sum_{i=1}^{n}\sum_{j=1}^{p}(X_{ij}-\bar{X}_j)^2 \]

Statistics behind k-means clustering

Recap

Here is a very good animation to illustrate the k-means clustering algorithm:

Visualizing K-means Clustering

K-means algorithm

  1. Randomly find \(k\) data points (observations) as the initial centers
  2. For each data point, find the closest center and label it. Now you have \(k\) clusters
  3. Re-calculate the centers of current clusters
  4. Repeat step 2 and step 3 until the centers do not change

Classification problems

Bias-Variance Tradeoff

Bias-Variance Tradeoff

Bias-Variance Tradeoff

Source: Elite Data Science

Bias-Variance Tradeoff

\[ \mathbb{E}[y_0-\hat{f}(x_0)]^2 = \sigma^2 + Bias^2(\hat{f}(x_0)) + Var(\hat{f}(x_0)) \]

Model assessment (for supervised learning)

How do we know if the estimated model \(\hat{f}(x)\) is useful?

Prediction error

Accuracy illustration

K-fold cross-validation

Leave-one-out cross-validation

\[ CV_n=\frac{1}{n}\sum_{i=1}^{n}\left(\frac{y_i-\hat{y}_i}{1-h_i}\right)^2 \]

where \(h_i\) is the diagonal element of the hat matrix.

Summary

go to top