1 Objective

The objective of this case is to get you understand logistic regression (binary classification) and some important ideas such as cross validation, ROC curve, cut-off probability.

2 Credit Card Default Data

We will use a Credit Card Default Data for this lab and illustration. The details of the data can be found at http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. Think about what kind of factors could affect people to fail to pay their credit balance.

We first load the credit scoring data. It is easy to load comma-separated values (CSV).

credit.data <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/credit_default.csv", header=T)

Look at what information do we have.

colnames(credit.data)
##  [1] "LIMIT_BAL"                  "SEX"                       
##  [3] "EDUCATION"                  "MARRIAGE"                  
##  [5] "AGE"                        "PAY_0"                     
##  [7] "PAY_2"                      "PAY_3"                     
##  [9] "PAY_4"                      "PAY_5"                     
## [11] "PAY_6"                      "BILL_AMT1"                 
## [13] "BILL_AMT2"                  "BILL_AMT3"                 
## [15] "BILL_AMT4"                  "BILL_AMT5"                 
## [17] "BILL_AMT6"                  "PAY_AMT1"                  
## [19] "PAY_AMT2"                   "PAY_AMT3"                  
## [21] "PAY_AMT4"                   "PAY_AMT5"                  
## [23] "PAY_AMT6"                   "default.payment.next.month"

Let’s look at how many people were actually default in this sample.

mean(credit.data$default.payment.next.month)
## [1] 0.2193333

The name of response variable is too long! I want to make it shorter by renaming. Recall the rename() function.

library(dplyr)
credit.data<- rename(credit.data, default=default.payment.next.month)

How about the variable type and summary statistics?

str(credit.data)    # structure - see variable type
summary(credit.data) # summary statistics

We see all variables are int, but we know that SEX, EDUCATION, MARRIAGE are categorical, we convert them to factor.

credit.data$SEX<- as.factor(credit.data$SEX)
credit.data$EDUCATION<- as.factor(credit.data$EDUCATION)
credit.data$MARRIAGE<- as.factor(credit.data$MARRIAGE)

We omit other EDA, but you shouldn’t whenever you are doing data analysis.

go to top

3 Logistic Regression

Randomly split the data to training (80%) and testing (20%) datasets:

index <- sample(nrow(credit.data),nrow(credit.data)*0.80)
credit.train = credit.data[index,]
credit.test = credit.data[-index,]

3.1 Train a logistic regression model with all variables

credit.glm0<- glm(default~., family=binomial, data=credit.train)
summary(credit.glm0)
## 
## Call:
## glm(formula = default ~ ., family = binomial, data = credit.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2153  -0.6919  -0.5357  -0.2877   3.2423  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.102e+00  1.542e-01  -7.147 8.86e-13 ***
## LIMIT_BAL   -5.454e-07  2.807e-07  -1.943 0.051960 .  
## SEX2        -1.389e-01  5.469e-02  -2.539 0.011113 *  
## EDUCATION2  -1.044e-01  6.341e-02  -1.647 0.099652 .  
## EDUCATION3  -1.548e-01  8.537e-02  -1.813 0.069820 .  
## EDUCATION4  -1.033e+00  3.120e-01  -3.311 0.000929 ***
## MARRIAGE2   -1.286e-01  6.199e-02  -2.074 0.038084 *  
## MARRIAGE3   -3.564e-01  2.576e-01  -1.384 0.166465    
## AGE          7.708e-03  3.338e-03   2.310 0.020914 *  
## PAY_0        5.991e-01  3.170e-02  18.899  < 2e-16 ***
## PAY_2        9.769e-02  3.603e-02   2.712 0.006698 ** 
## PAY_3        5.947e-02  4.069e-02   1.462 0.143856    
## PAY_4        4.429e-02  4.503e-02   0.984 0.325348    
## PAY_5        1.970e-02  4.768e-02   0.413 0.679467    
## PAY_6        1.865e-02  4.014e-02   0.465 0.642228    
## BILL_AMT1   -4.487e-06  1.904e-06  -2.357 0.018427 *  
## BILL_AMT2    1.195e-06  2.597e-06   0.460 0.645434    
## BILL_AMT3    3.145e-06  2.340e-06   1.344 0.178834    
## BILL_AMT4   -3.709e-06  2.464e-06  -1.505 0.132281    
## BILL_AMT5    2.964e-06  2.836e-06   1.045 0.295915    
## BILL_AMT6   -6.081e-07  2.245e-06  -0.271 0.786447    
## PAY_AMT1    -9.885e-06  3.588e-06  -2.755 0.005873 ** 
## PAY_AMT2    -1.014e-05  3.570e-06  -2.840 0.004517 ** 
## PAY_AMT3    -3.429e-06  3.360e-06  -1.020 0.307592    
## PAY_AMT4    -4.662e-06  3.100e-06  -1.504 0.132633    
## PAY_AMT5    -1.890e-06  3.164e-06  -0.597 0.550193    
## PAY_AMT6    -2.473e-06  2.122e-06  -1.166 0.243741    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10063.2  on 9599  degrees of freedom
## Residual deviance:  8788.5  on 9573  degrees of freedom
## AIC: 8842.5
## 
## Number of Fisher Scoring iterations: 6

You have seen glm() before. In this lab, this is the main function used to build logistic regression model because it is a member of generalized linear model. In glm(), the only thing new is family. It specifies the distribution of your response variable. You may also specify the link function after the name of distribution, for example, family=binomial(logit) (default link is logit). You can also specify family=binomial(link = "probit") to run probit regression. You may also use glm() to build many other generalized linear models.

3.2 Get some criteria of model fitting

You can simply extract some criteria of the model fitting, for example, Residual deviance (equivalent to SSE in linear regression model), AIC and BIC. Unlike linear regression models, there is no \(R^2\) in logistic regression.

credit.glm0$deviance
## [1] 8788.516
AIC(credit.glm0)
## [1] 8842.516
BIC(credit.glm0)
## [1] 9036.093

3.3 Prediction

Similar to linear regression, we use predict() function for prediction.

To get prediction from a logistic regression model, there are several steps you need to understand. Refer to textbook/slides for detailed math.

1.The fitted model \(\hat{\eta} = b_0 +b_1 x_1 + b_2 x_2 + ...\) gives you the estimated value before the inverse of link (logit in case of logistic regression). In logistic regression the \(\hat{\eta}\) are called log odds ratio, which is \(\log(P(y=1)/(1-P(y=1)))\). In R you use the predict() function to get a vector of all in-sample \(\hat{\eta}\) (for each training obs).

hist(predict(credit.glm0))

2.For each \(\hat{\eta}\), in order to get the P(y=1), we can apply the inverse of the link function (logit here) to \(\hat{\eta}\). The equation is \(P(y=1) = 1/ (1+exp(-\hat{\eta}))\). In R you use the fitted() function or *predict(,type=“response”) to get the predicted probability for each training ob.

pred_resp <- predict(credit.glm0,type="response")
hist(pred_resp)

3.Last but not least, you want a binary classification decision rule. The default rule is if the fitted \(P(y=1) > 0.5\) then \(y = 1\). The value 0.5 is called cut-off probability. You can choose the cut-off probability based on mis-classification rate, cost function, etc. In this case, the cost function can indicate the trade off between the risk of giving loan to someone who cannot pay (predict 0, truth 1), and risk of rejecting someone who qualifys (predict 1, truth 0).

These tables illustrate the impact of choosing different cut-off probability. Choosing a large cut-off probability will result in few cases being predicted as 1, and chossing a small cut-off probability will result in many cases being predicted as 1.

table(predict(credit.glm0,type="response") > 0.5)
## 
## FALSE  TRUE 
##  8858   742
table(predict(credit.glm0,type="response") > 0.2)
## 
## FALSE  TRUE 
##  5253  4347
table(predict(credit.glm0,type="response") > 0.0001)
## 
## FALSE  TRUE 
##     2  9598

3.3.1 In-sample prediction (less important)

pred.glm0.train<- predict(credit.glm0, type="response")

(IMPORTANT!!!) You have to specify type="response" in order to get probability outcome, which is what we want. Otherwise, what it produces is the linear predictor term \(\beta_0+\beta_1X_1+\beta_2X_2+\dotso\). Recall the lecture, how is this linear predictor related to probability?

3.3.1.1 ROC Curve

In order to show give an overall measure of goodness of classification, using the Receiver Operating Characteristic (ROC) curve is one way. Rather than use an overall misclassification rate, it employs two measures – true positive fraction (TPF) and false positive fraction (FPF).

True positive fraction, \(\text{TPF}=\frac{\text{TP}}{\text{TP+FN}}\): is the proportion of true positives correctly predicted as positive.

False positive fraction, \(\text{FPF}=\frac{\text{FP}}{\text{FP+TN}}=1-\frac{\text{TN}}{\text{FP+TN}}\): is the proportion of true negatives incorrectly predicted as positive.

install.packages('ROCR')
library(ROCR)
pred <- prediction(pred.glm0.train, credit.train$default)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7267963

Be careful that the function prediction() is different from predict(). It is in Package ROCR, and is particularly used for preparing for ROC curve. Recall out lecture, this function basically calculates many confusion matrices with different cut-off probability. Therefore, it requires two vectors as inputs – predicted probability and observed response (0/1). The next line, performance() calculates TPR and FPR based all confusion matrices you get from previous step. Then you can simply draw the ROC curve, which is a curve of FPR vs. TPR. The last line is to get AUC (area under the curve). I would recommend you to stick these four lines of code together, and use it to get ROC curve and AUC. If you don’t want to draw the ROC curve (because it takes time), just comment out plot line.

3.3.2 Out-of-sample prediction (more important)

pred.glm0.test<- predict(credit.glm0, newdata = credit.test, type="response")

For out-of-sample prediction, you have to specify newdata="testing sample name".

3.3.2.1 ROC Curve

pred <- prediction(pred.glm0.test, credit.test$default)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7307414

3.3.3 (Optional) Precision-Recall Curve

Precision and recall curve and its AUC is more appropriate for imbalanced data. We use package PRROC to draw the PR curve. It can also draw the ROC curve. More details of the package can be found here.

install.packages("PRROC")
library(PRROC)
score1= pred.glm0.train[credit.train$default==1]
score0= pred.glm0.train[credit.train$default==0]
roc= roc.curve(score1, score0, curve = T)
roc$auc
## [1] 0.7267963
pr= pr.curve(score1, score0, curve = T)
pr
## 
##   Precision-recall curve
## 
##     Area under curve (Integral):
##      0.5147144 
## 
##     Area under curve (Davis & Goadrich):
##      0.5147039 
## 
##     Curve for scores from  9.83699e-09  to  0.9950685 
##     ( can be plotted with plot(x) )
plot(pr, main="In-sample PR curve")

# Out-of-sample prediction: 
score1.test= pred.glm0.test[credit.test$default==1]
score0.test= pred.glm0.test[credit.test$default==0]
roc.test= roc.curve(score1.test, score0.test, curve = T)
roc.test$auc
## [1] 0.7307414
pr.test= pr.curve(score1.test, score0.test, curve = T)
pr.test
## 
##   Precision-recall curve
## 
##     Area under curve (Integral):
##      0.4938487 
## 
##     Area under curve (Davis & Goadrich):
##      0.4938215 
## 
##     Curve for scores from  6.20749e-05  to  0.9941786 
##     ( can be plotted with plot(x) )
plot(pr.test, main="Out-of-sample PR curve")

Out-of-sample prediction:

3.4 Binary Classification

As we talked in the lecture, people may be more interested in the classification results. But we have to define a cut-off probability first.

These tables illustrate the impact of choosing different cut-off probability. Choosing a large cut-off probability will result in few cases being predicted as 1, and chossing a small cut-off probability will result in many cases being predicted as 1.

table((pred.glm0.train > 0.9)*1)
## 
##    0    1 
## 9580   20
table((pred.glm0.train > 0.5)*1)
## 
##    0    1 
## 8858  742
table((pred.glm0.train > 0.2)*1)
## 
##    0    1 
## 5253 4347
table((pred.glm0.train > 0.0001)*1)
## 
##    0    1 
##    2 9598

Therefore, determine the optimal cut-off probability is crucial. The simplest way to determine the cut-off is to use the proportion of “1” in the original data. We will intriduce a more appropriate way to determine the optimal p-cut.

3.4.1 Naive Choice of Cut-off probability

The simplest way is to choose the event proportion in training sample. This is roughly reasonable because the sample proportion is an estimate of mean probability of \(Y=1\).

pcut1<- mean(credit.train$default)

Based on this cut-off probability, we can obtain the binary prediction (predicted classification) and the confusion matrix

# get binary prediction
class.glm0.train<- (pred.glm0.train>pcut1)*1
# get confusion matrix
table(credit.train$default, class.glm0.train, dnn = c("True", "Predicted"))
##     Predicted
## True    0    1
##    0 5309 2200
##    1  730 1361

In table() function, two vectors must be both binary in order to get confusion matrix (it is essentially a pivot table or contingency table), dnn is to specify the row and column name of this 2*2 table. The first input vector is TRUE, so the first name should be TRUE accordingly.

Then it is easy to get different types of classification error rate, i.e., false positive rate (FPR), false negative rate (FNR), and overall misclassification rate (MR). Commonly, you can use overall MR as the cost (a criterion) to evaluate the model prediction.

# (equal-weighted) misclassification rate
MR<- mean(credit.train$default!=class.glm0.train)
# False positive rate
FPR<- sum(credit.train$default==0 & class.glm0.train==1)/sum(credit.train$default==0)
# False negative rate (exercise)
FNR<- 

3.4.2 Determine Optimal cut-off Probability using Grid Search Method

Recall the lecture, different p-cut results in different confusion matrix, hence different MR (or cost). You need to search all possible p-cut to find the one that provides minimum cost. The first step is to define a symmetric/asymmetric cost function (misclassification rate), as a function of cut-off. Think about what else is needed to calculate MR? The answer is observed \(Y\) and predicted probability.

# define a cost function with input "obs" being observed response 
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
    weight1 = 5   # define the weight for "true=1 but pred=0" (FN)
    weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
    c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
    c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
    cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
    return(cost) # you have to return to a value when you write R functions
} # end of the function

Next, define a sequence of probability (you need to search the optimal p-cut from this sequence)

# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01) 

Then, you need to calculate the cost (as you defined before) for each probability in the sequence p.seq.

# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    cost[i] = costfunc(obs = credit.train$default, pred.p = pred.glm0.train, pcut = p.seq[i])  
} # end of the loop

Last, draw a plot with cost against p.seq, and find the p-cut that gives you the minimum cost.

# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, cost)

# find the optimal pcut
optimal.pcut.glm0 = p.seq[which(cost==min(cost))]

3.4.2.1 Use the optimal cut-off probability

Now let calculate MR, FPR, FNR, and cost based on the optimal cut-off.

# step 1. get binary classification
class.glm0.train.opt<- (pred.glm0.train>optimal.pcut.glm0)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(credit.train$default, class.glm0.train.opt, dnn = c("True", "Predicted"))
##     Predicted
## True    0    1
##    0 5393 2116
##    1  740 1351
MR<- mean(credit.train$default!= class.glm0.train.opt)
FPR<- sum(credit.train$default==0 & class.glm0.train.opt==1)/sum(credit.train$default==0)
FNR<- sum(credit.train$default==1 & class.glm0.train.opt==0)/sum(credit.train$default==1)
cost<- costfunc(obs = credit.train$default, pred.p = pred.glm0.train, pcut = optimal.pcut.glm0)  

3.4.2.2 Exercise:

  • Change the weights to different values, and see how your optimal cut-off changes.

  • obtain confusion matrix and calculate the (asymmetric) cost based on the optimal cut-off.

  • Find optimal cut-off probability using symmetric cost.

  • Calculate MR and cost, what do you find?

  • Further, rewrite the cost function to make the weights (or the ratio of two weights) as input parameter.

  • Use F-score to determine the optimal cut-off.

3.4.3 Out-of-sample Classification

Everything you have done about classification so far is for training sample. Now let’s get to testing sample. Keep in mind the principle, testing sample is only used for evaluating your model’s prediction accuracy! NO NEED TO CHOOSE CUT-OFF PROBABILITY in this stage.

3.4.3.1 Exercise:

  • Calculate MR, FPR, FNR based on the optimal cut-off you get from training sample with weights (5:1)

  • Calculate asymetric cost based on the optimal cut-off you get from training sample with weights (5:1)

  • Calculate above statistics based on the cut-off you get from training sample with symmetric weights (1:1)


3.5 Variable Selection

3.5.1 Variable Selection with Stepwise Approach

We can use the same procedures of variable selection, i.e. forward, backward, and stepwise, for linear regression models. Caution: this will take a long period of time since the dimension of predictor variables is not very small and the sample size is large.

credit.glm.back <- step(credit.glm0) # backward selection (if you don't specify anything)
summary(credit.glm.back)
credit.glm.back$deviance
AIC(credit.glm.back)
BIC(credit.glm.back)

You can try model selection with BIC (usually results in a simpler model than AIC criterion)

credit.glm.back.BIC <- step(credit.glm0, k=log(nrow(credit.train))) 
summary(credit.glm.back.BIC)
credit.glm.back.BIC$deviance
AIC(credit.glm.back.BIC)
BIC(credit.glm.back.BIC)

Exercise: Try forward and stepwise selection procedures to see if they deliver the same best model.

3.5.1.1 Exercise:

  • Get ROC curve and AUC for both training and testing sample

  • Using training sample, find the optimal cut-off by grid search method with asymmetric cost (weights ratio = 5:1)

  • Calculate MR, FPR, FNR, the asymmetric cost for both taining and testing sample.


3.5.2 Variable selection with LASSO

Be careful that LASSO does require x to be numeric matrix. Therefore, we need to manually convert categorical variable (“SEX”, “EDUCATION” and “MARRIAGE”) to dummy variable. For simplicity, only if you have evidence that the categorical variable has monotonic relationship to response can you directly convert it to numeric by using as.numeric(). For example, the probability of default increases/decreases as EDUCATION level goes from 1 to 4. This can be seen from the two-way contingency table by calculating the default proportion at each education level.

Here I will show how to convert categorical variable to dummy variables.

dummy<- model.matrix(~ ., data = credit.data)
# look at first few rows of data
head(dummy)

The function model.matrix() can automatically convert categorical variable to dummy. It also creates a column of 1, which we don’t need at this time. That column of 1 is used for estimating intercept if you write algorithm by yourself, but most available functions automatically creates that column during estimation.

credit.data.lasso<- data.frame(dummy[,-1])

Now let’s get data prepared for LASSO.

#index <- sample(nrow(credit.data),nrow(credit.data)*0.80)
credit.train.X = as.matrix(select(credit.data.lasso, -default)[index,])
credit.test.X = as.matrix(select(credit.data.lasso, -default)[-index,])
credit.train.Y = credit.data.lasso[index, "default"]
credit.test.Y = credit.data.lasso[-index, "default"]
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0
credit.lasso<- glmnet(x=credit.train.X, y=credit.train.Y, family = "binomial")

Perform cross-validation to determine the shrinkage parameter.

credit.lasso.cv<- cv.glmnet(x=credit.train.X, y=credit.train.Y, family = "binomial", type.measure = "class")
plot(credit.lasso.cv)

For logistc regression, we can specify type.measure="class" so that the CV error will be misclassification error.

Get the coefficient with optimal \(\lambda\)

coef(credit.lasso, s=credit.lasso.cv$lambda.min)
## 27 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -1.113737e+00
## LIMIT_BAL   -5.420344e-07
## SEX2        -1.353716e-01
## EDUCATION2  -9.552351e-02
## EDUCATION3  -1.432761e-01
## EDUCATION4  -9.970364e-01
## MARRIAGE2   -1.238182e-01
## MARRIAGE3   -3.366212e-01
## AGE          7.509426e-03
## PAY_0        5.997878e-01
## PAY_2        9.585219e-02
## PAY_3        6.169264e-02
## PAY_4        4.285270e-02
## PAY_5        1.955502e-02
## PAY_6        1.949307e-02
## BILL_AMT1   -3.244624e-06
## BILL_AMT2    9.056760e-10
## BILL_AMT3    2.200039e-06
## BILL_AMT4   -1.691168e-06
## BILL_AMT5    1.159253e-06
## BILL_AMT6    .           
## PAY_AMT1    -8.559325e-06
## PAY_AMT2    -9.489678e-06
## PAY_AMT3    -4.272502e-06
## PAY_AMT4    -3.374921e-06
## PAY_AMT5    -2.345765e-06
## PAY_AMT6    -2.284430e-06
coef(credit.lasso, s=credit.lasso.cv$lambda.1se)
## 27 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -1.190839e+00
## LIMIT_BAL   -4.690042e-07
## SEX2        -1.088409e-01
## EDUCATION2  -1.729384e-02
## EDUCATION3  -3.274606e-02
## EDUCATION4  -7.405616e-01
## MARRIAGE2   -8.823516e-02
## MARRIAGE3   -1.819042e-01
## AGE          5.764235e-03
## PAY_0        5.971749e-01
## PAY_2        9.141015e-02
## PAY_3        6.276862e-02
## PAY_4        3.911709e-02
## PAY_5        1.803554e-02
## PAY_6        1.706461e-02
## BILL_AMT1   -1.848502e-06
## BILL_AMT2    .           
## BILL_AMT3    .           
## BILL_AMT4    .           
## BILL_AMT5    .           
## BILL_AMT6    .           
## PAY_AMT1    -6.216713e-06
## PAY_AMT2    -5.755792e-06
## PAY_AMT3    -3.937680e-06
## PAY_AMT4    -1.602049e-06
## PAY_AMT5    -1.599517e-06
## PAY_AMT6    -1.489924e-06

3.5.2.1 Prediction

# in-sample prediction
pred.lasso.train<- predict(credit.lasso, newx=credit.train.X, s=credit.lasso.cv$lambda.1se, type = "response")
# out-of-sample prediction
pred.lasso.test<- predict(credit.lasso, newx=credit.test.X, s=credit.lasso.cv$lambda.1se, type = "response")

3.5.2.2 Exercise:

  • Get ROC curve and AUC for both training and testing sample

  • Using training sample, find the optimal cut-off by grid search method with asymmetric cost (weights ratio = 5:1)

  • Calculate MR, FPR, FNR, the asymmetric cost for both taining and testing sample.

3.6 Cross validation

Refer to lecture slides and Elements of Statistical Learning book (section 7.10) for more advice on cross validation.

pcut = 0.5
#Symmetric cost
cost1 <- function(r, pi, pcut){
  mean(((r==0)&(pi>pcut)) | ((r==1)&(pi<pcut)))
}

#Asymmetric cost
cost2 <- function(r, pi, pcut){
  weight1 = 2
  weight0 = 1
  c1 = (r==1)&(pi<pcut) #logical vector - true if actual 1 but predict 0
  c0 = (r==0)&(pi>pcut) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}

You should read the cv.glm help to understand how it works. We can use the same cost function as defined before, but you need to modify it such that there are only two input: observed \(Y\) and predicted probability.

costfunc = function(obs, pred.p){
    weight1 = 5   # define the weight for "true=1 but pred=0" (FN)
    weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
    c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
    c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
    cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
    return(cost) # you have to return to a value when you write R functions
} # end of the function

Then you need to assign a value to pcut.

pcut = optimal.pcut.glm0  

10-fold cross validation, note you should use the full data for cross-validation. In cv.glm, default cost function is the average squared error function.

library(boot)
credit.glm1<- glm(default~. , family=binomial, data=credit.data);  
cv.result = cv.glm(data=credit.data, glmfit=credit.glm1, cost=costfunc, K=10) 
cv.result$delta[2]
## [1] 0.6184

The first component of delta is the raw cross-validation estimate of prediction error. The second component is the adjusted cross-validation estimate. The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.

Keep in mind that CV-score is averaged model error. Here, it is the cost you have defined before. You may also use F-score for cross-validation, but again, you need to define the function of F-score. (Exercise!)

go to top

4 (Optional) Two-way contingency table and Chi-square test

Two-way contingency table is a very useful tool for exploring the relationship between categorical variables. It is essentially the simplest pivot-table (see example below). Often time, after you create a two-way contingency table, Chi-square test is used to test if X affect Y. The null hypothesis is: X and Y are independent (e.g., MARRIAGE has nothing to do with likelihood of default).

The \(\chi^2\) test statistic is defined as \[\chi^2=\sum \frac{(observed-expected)^2}{expected},\] where the expected count is calculated by assuming row variable has nothing to do with column variable.

Here is a very good tutorial for Chi-square test https://www.youtube.com/watch?v=WXPBoFDqNVk.

table.edu<- table(credit.data$EDUCATION, credit.data$default)
table.edu
##    
##        0    1
##   1 3399  829
##   2 4298 1298
##   3 1501  491
##   4  170   14
chisq.test(table.edu)
## 
##  Pearson's Chi-squared test
## 
## data:  table.edu
## X-squared = 49.19, df = 3, p-value = 1.189e-10

What we saw from above test result is that p-value < 0.05. What is your conclusion?

go to top

5 Summary

5.1 Things to remember

  • Know how to use glm() to build logistic regression;

  • Know how to get ROC and AUC based on predicted probability;

  • Know how to get PR curve and AUC based on predicted probability;

  • Know how to find optimal cut-off;

  • Know how to do binary classification, and calculation of MR, FPR, FNR, and cost;

  • Know how to use LASSO for logistic regression

5.2 Guide for Assignment

  • EDA

  • Train logistic model

  • Prediction (ROC, AUC; PR, AUC)

  • Model comparison using AUC

  • Find optimal cut-off based on training sample

  • Classification – Obtain the confusion matrix and calculate the MR, asymmetric cost, FPR, Precision, Recall, and F-score for both training and testing sample based on (1) naive cut-off determined by sample proportion; (2) optimal cut-off determined by asymmetric cost; (3) optimal cut-off determined by F-score.

  • Build new models by Variable Selection

  • Calculate all criteria

  • Comprehensive comparison

go to top

6 Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data) for variable description. Notice that “It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).” Define your cost function accordingly!

german_credit = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")

colnames(german_credit)=c("chk_acct","duration","credit_his","purpose","amount","saving_acct","present_emp","installment_rate","sex","other_debtor","present_resid","property","age","other_install","housing","n_credits","job","n_people","telephone","foreign","response")

#orginal response coding 1= good, 2 = bad
#we need 0 = good, 1 = bad
german_credit$response = german_credit$response - 1

go to top