Cross validation is an alternative approach to training/testing split. For k-fold cross validation, the dataset is divided into k parts. Each part serves as the test set in each iteration and the rest serve as training set. The out-of-sample performance measures from the k iterations are averaged. Instead of fitting a model on a pre-specified 90% training sample and evaluate the MSPE on the hold-out 10% testing sample, it is more reliable to use cross-validation for out-of-sample performance evaluation. For k-fold cross-validation, the dataset is divided into k parts (equal sample size). Each part serves as the testing sample in and the rest (k-1 together) serves as training sample. This training/testing procedure is iteratively performed k times. The CV score is usually the average of the metric of out-of-sample performance across k iterations.

Note

We use the

**entire**dataset for cross validationWe need to use glm instead of lm to fit the model (if we want to use cv.glm fucntion in boot package)

The default measure of performance is the Mean Squared Error (MSE). If we want to use another measure we need to define a cost function.

The `cv.glm`

is the cross validation approach in R for glm (more details of arguments are in help doc). By comparing the cross-validation estimate of prediction error, we can tell the full model outperforms the model with only `indus`

and `rm`

in terms of prediction error.

```
library(MASS)
data(Boston); #this data is in MASS package
library(boot)
model_full <- glm(medv~., data = Boston)
cv.glm(data = Boston, glmfit = model_full, K = 10)$delta[2]
```

`## [1] 24.02774`

```
model_2 <- glm(medv~indus + rm, data = Boston)
cv.glm(data = Boston, glmfit = model_2, K = 10)$delta[2]
```

`## [1] 39.70648`

Here we need to define a MAE cost function. The function takes 2 input vectors, pi = predicted values, r = actual values.

```
MAE_cost <- function(pi, r){
return(mean(abs(pi-r)))
}
cv.glm(data = Boston, glmfit = model_full, cost = MAE_cost, K = 10)$delta[2]
```

`## [1] 3.368605`

`cv.glm(data = Boston, glmfit = model_2, cost = MAE_cost, K = 10)$delta[2]`

`## [1] 4.274093`

The same finding is observed by conducting LOOCV.

`cv.glm(data = Boston, glmfit = model_full, K = nrow(Boston))$delta[2]`

`## [1] 23.72388`

`cv.glm(data = Boston, glmfit = model_2, K = nrow(Boston))$delta[2]`

`## [1] 39.94028`

Using 10-fold cross-validation to search optimal lambda:

```
library(glmnet)
lasso_fit <- glmnet(x = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), y = Boston$medv, alpha = 1)
#use 10-fold cross validation to pick lambda
cv_lasso_fit = cv.glmnet(x = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), y = Boston$medv, alpha = 1, nfolds = 10)
plot(cv_lasso_fit)
```

The best \(\lambda\) (or *s*) is given by:

`cv_lasso_fit$lambda.min`

`## [1] 0.02118502`

Given a selected *s* you can use *predict()* this way to get prediction:

`coef(lasso_fit, s = cv_lasso_fit$lambda.min)`

```
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 34.880894908
## crim -0.100714832
## zn 0.042486737
## indus .
## chas 2.693903097
## nox -16.562664327
## rm 3.851646315
## age .
## dis -1.419168850
## rad 0.263725830
## tax -0.010286456
## ptratio -0.933927773
## black 0.009089735
## lstat -0.522521473
```

```
pred_IS <- predict(lasso_fit, as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), s = cv_lasso_fit$lambda.min)
MAE_cost(pred_IS, Boston$medv)
```

`## [1] 3.259372`

Another package DAAG also does cross validation. It prints out the performance in each fold and gives you a plot at the end. But currently I cannot figure out how to get the cross-validation error programmatically.

`install.packages('DAAG')`

`library(DAAG)`

```
model_2 <- lm(medv~indus + rm, data = Boston)
cv3 <- cv.lm(data = Boston, form.lm = model_2, m=3)
```