We will use a Credit Card Default Data for this lab and illustration. The details of the data can be found at http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. Think about what kind of factors could affect people to fail to pay their credit balance.
We first load the credit scoring data. It is easy to load comma-separated values (CSV).
credit_data <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/credit_default.csv", header=T)
Look at what information do we have.
colnames(credit_data)
## [1] "LIMIT_BAL" "SEX"
## [3] "EDUCATION" "MARRIAGE"
## [5] "AGE" "PAY_0"
## [7] "PAY_2" "PAY_3"
## [9] "PAY_4" "PAY_5"
## [11] "PAY_6" "BILL_AMT1"
## [13] "BILL_AMT2" "BILL_AMT3"
## [15] "BILL_AMT4" "BILL_AMT5"
## [17] "BILL_AMT6" "PAY_AMT1"
## [19] "PAY_AMT2" "PAY_AMT3"
## [21] "PAY_AMT4" "PAY_AMT5"
## [23] "PAY_AMT6" "default.payment.next.month"
Let’s look at how many people were actually default in this sample.
mean(credit_data$default.payment.next.month)
## [1] 0.2193333
The name of response variable is too long! I want to make it shorter by renaming. Recall the rename()
function.
library(dplyr)
credit_data<- rename(credit_data, default=default.payment.next.month)
How about the variable type and summary statistics?
str(credit_data) # structure - see variable type
summary(credit_data) # summary statistics
We see all variables are int, but we know that SEX, EDUCATION, MARRIAGE are categorical, we convert them to factor.
credit_data$SEX<- as.factor(credit_data$SEX)
credit_data$EDUCATION<- as.factor(credit_data$EDUCATION)
credit_data$MARRIAGE<- as.factor(credit_data$MARRIAGE)
We omit other EDA, but you shouldn’t whenever you are doing data analysis.
Randomly split the data to training (80%) and testing (20%) datasets:
index <- sample(nrow(credit_data),nrow(credit_data)*0.80)
credit_train = credit_data[index,]
credit_test = credit_data[-index,]
credit_glm0<- glm(default~., family=binomial, data=credit_train)
summary(credit_glm0)
##
## Call:
## glm(formula = default ~ ., family = binomial, data = credit_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2726 -0.6920 -0.5406 -0.2776 3.0644
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.100e+00 1.536e-01 -7.165 7.77e-13 ***
## LIMIT_BAL -6.301e-07 2.829e-07 -2.227 0.025918 *
## SEX2 -1.542e-01 5.459e-02 -2.825 0.004722 **
## EDUCATION2 -1.613e-01 6.305e-02 -2.559 0.010506 *
## EDUCATION3 -1.927e-01 8.554e-02 -2.253 0.024283 *
## EDUCATION4 -1.325e+00 3.501e-01 -3.785 0.000154 ***
## MARRIAGE2 -1.395e-01 6.169e-02 -2.261 0.023731 *
## MARRIAGE3 6.827e-02 2.375e-01 0.288 0.773709
## AGE 8.258e-03 3.325e-03 2.484 0.013005 *
## PAY_0 6.228e-01 3.150e-02 19.771 < 2e-16 ***
## PAY_2 9.455e-02 3.574e-02 2.646 0.008148 **
## PAY_3 8.688e-02 4.004e-02 2.170 0.030025 *
## PAY_4 6.026e-03 4.519e-02 0.133 0.893921
## PAY_5 5.805e-02 4.799e-02 1.210 0.226401
## PAY_6 -9.798e-03 4.001e-02 -0.245 0.806567
## BILL_AMT1 -5.755e-06 1.967e-06 -2.926 0.003429 **
## BILL_AMT2 4.598e-06 2.466e-06 1.864 0.062258 .
## BILL_AMT3 8.965e-07 2.069e-06 0.433 0.664839
## BILL_AMT4 -3.920e-06 2.445e-06 -1.604 0.108786
## BILL_AMT5 7.988e-07 2.957e-06 0.270 0.787070
## BILL_AMT6 2.231e-06 2.357e-06 0.946 0.343933
## PAY_AMT1 -1.106e-05 3.572e-06 -3.095 0.001965 **
## PAY_AMT2 -4.696e-06 2.998e-06 -1.566 0.117341
## PAY_AMT3 1.031e-06 2.465e-06 0.418 0.675820
## PAY_AMT4 -2.251e-06 2.783e-06 -0.809 0.418683
## PAY_AMT5 -4.360e-06 3.195e-06 -1.365 0.172324
## PAY_AMT6 -5.589e-06 2.546e-06 -2.196 0.028127 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10149 on 9599 degrees of freedom
## Residual deviance: 8809 on 9573 degrees of freedom
## AIC: 8863
##
## Number of Fisher Scoring iterations: 5
You have seen glm()
before. In this lab, this is the main function used to build logistic regression model because it is a member of generalized linear model. In glm()
, the only thing new is family
. It specifies the distribution of your response variable. You may also specify the link function after the name of distribution, for example, family=binomial(logit)
(default link is logit). You can also specify family=binomial(link = "probit")
to run probit regression. You may also use glm()
to build many other generalized linear models.
Similar to linear regression, we use predict()
function for prediction.
To get prediction from a logistic regression model, there are several steps you need to understand. Refer to textbook/slides for detailed math.
1.The fitted model ˆη=b0+b1x1+b2x2+... gives you the estimated value before the inverse of link (logit in case of logistic regression). In logistic regression the ˆη are called log odds ratio, which is log(P(y=1)/(1−P(y=1))). In R you use the predict() function to get a vector of all in-sample ˆη (for each training obs).
hist(predict(credit_glm0))
2.For each ˆη, in order to get the P(y=1), we can apply the inverse of the link function (logit here) to ˆη. The equation is P(y=1)=1/(1+exp(−ˆη)). In R you use the fitted() function or *predict(,type=“response”) to get the predicted probability for each training ob.
pred_resp <- predict(credit_glm0,type="response")
hist(pred_resp)
3.Last but not least, you want a binary classification decision rule. The default rule is if the fitted P(y=1)>0.5 then y=1. The value 0.5 is called cut-off probability. You can choose the cut-off probability based on mis-classification rate, cost function, etc. In this case, the cost function can indicate the trade off between the risk of giving loan to someone who cannot pay (predict 0, truth 1), and risk of rejecting someone who qualifys (predict 1, truth 0).
These tables illustrate the impact of choosing different cut-off probability. Choosing a large cut-off probability will result in few cases being predicted as 1, and chossing a small cut-off probability will result in many cases being predicted as 1.
table(credit_train$default, (pred_resp > 0.5)*1, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 7267 208
## 1 1546 579
table(credit_train$default, (pred_resp > 0.2)*1, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 4563 2912
## 1 625 1500
table(credit_train$default, (pred_resp > 0.0001)*1, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 1 7474
## 1 0 2125
pred_glm0_train<- predict(credit_glm0, type="response")
(IMPORTANT!!!) You have to specify type="response"
in order to get probability outcome, which is what we want. Otherwise, what it produces is the linear predictor term β0+β1X1+β2X2+…. Recall the lecture, how is this linear predictor related to probability?
In order to show give an overall measure of goodness of classification, using the Receiver Operating Characteristic (ROC) curve is one way. Rather than use an overall misclassification rate, it employs two measures – true positive fraction (TPF) and false positive fraction (FPF).
True positive fraction, TPF=TPTP+FN: is the proportion of true positives correctly predicted as positive.
False positive fraction, FPF=FPFP+TN=1−TNFP+TN: is the proportion of true negatives incorrectly predicted as positive.
install.packages('ROCR')
library(ROCR)
pred <- prediction(pred_glm0_train, credit_train$default)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7337419
Be careful that the function prediction()
is different from predict()
. It is in Package ROCR
, and is particularly used for preparing for ROC curve. Recall out lecture, this function basically calculates many confusion matrices with different cut-off probability. Therefore, it requires two vectors as inputs – predicted probability and observed response (0/1). The next line, performance()
calculates TPR and FPR based all confusion matrices you get from previous step. Then you can simply draw the ROC curve, which is a curve of FPR vs. TPR. The last line is to get AUC (area under the curve). I would recommend you to stick these four lines of code together, and use it to get ROC curve and AUC. If you don’t want to draw the ROC curve (because it takes time), just comment out plot line.
pred_glm0_test<- predict(credit_glm0, newdata = credit_test, type="response")
For out-of-sample prediction, you have to specify newdata="testing sample name"
.
pred <- prediction(pred_glm0_test, credit_test$default)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7026989
Precision and recall curve and its AUC is more appropriate for imbalanced data. We use package PRROC
to draw the PR curve. It can also draw the ROC curve. More details of the package can be found here.
install.packages("PRROC")
library(PRROC)
score1= pred_glm0_train[credit_train$default==1]
score0= pred_glm0_train[credit_train$default==0]
roc= roc.curve(score1, score0, curve = T)
roc$auc
## [1] 0.7337419
pr= pr.curve(score1, score0, curve = T)
pr
##
## Precision-recall curve
##
## Area under curve (Integral):
## 0.5244722
##
## Area under curve (Davis & Goadrich):
## 0.5244625
##
## Curve for scores from 1.165656e-05 to 0.9961144
## ( can be plotted with plot(x) )
plot(pr, main="In-sample PR curve")
# Out-of-sample prediction:
score1.test= pred_glm0_test[credit_test$default==1]
score0.test= pred_glm0_test[credit_test$default==0]
roc.test= roc.curve(score1.test, score0.test, curve = T)
roc.test$auc
## [1] 0.7026989
pr.test= pr.curve(score1.test, score0.test, curve = T)
pr.test
##
## Precision-recall curve
##
## Area under curve (Integral):
## 0.4475906
##
## Area under curve (Davis & Goadrich):
## 0.4475519
##
## Curve for scores from 0.0007718808 to 0.9952616
## ( can be plotted with plot(x) )
plot(pr.test, main="Out-of-sample PR curve")
Know how to use glm() to build logistic regression;
Know how to get ROC and AUC based on predicted probability;
Know how to get PR curve and AUC based on predicted probability;
EDA
Train logistic model
Prediction (ROC, AUC; PR, AUC)
Model comparison using AUC