mcycle
is a data frame giving a series of measurements of head acceleration in a simulated motorcycle accident, used to test crash helmets. It is an univariate case. We are interested in the relationship between the times in milliseconds after impact and the acceleration.
library(MASS)
data('mcycle')
str(mcycle)
summary(mcycle)
# Rename the variables for ease of usage
Y <- mcycle$accel
X <- mcycle$times
#Scatterplot
plot(Y~X, xlab="time",ylab="Acceleration", main="Scatterplot of Acceleration against Time")
We can simply assume a linear relationship and run a linear model between acceleration and time.
lm_mod <- lm(Y~X, data= mcycle)
summary(lm_mod)
Fitted Regression Line
But after we draw the fitted linear line on the scatterplot, we can clearly tell that the linear assumption seems violated.
plot(X, Y, xlab="Times", ylab="Acceleration", main="Simple Linear Regression Line")
abline(lm_mod, col="blue", lwd = 1)
If we assume there could be a polynomial relationship, we can try polynomial regression. The coefficients can be easily estimated using least squares linear regression because this is just a standard linear model with predictors \(x, x^2, x^3, \dots, x^d\).
We can conduct a quadratic regression as follows by simply adding I(X^2)
in the formula argument:
quad_mod <- lm(Y~X+I(X^2), data=mcycle)
summary(quad_mod)
##
## Call:
## lm(formula = Y ~ X + I(X^2), data = mcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.527 -30.817 9.589 29.210 104.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.22700 15.49471 -0.983 0.32757
## X -2.30749 1.20439 -1.916 0.05757 .
## I(X^2) 0.05935 0.02038 2.912 0.00422 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.06 on 130 degrees of freedom
## Multiple R-squared: 0.1437, Adjusted R-squared: 0.1306
## F-statistic: 10.91 on 2 and 130 DF, p-value: 4.167e-05
It seems that the fitted line captures the curvature a little bit better than the linear regression.
plot(X ,Y ,xlab="Times", main = "Quadratic",ylab="Acceleration",cex=.5)
lines(X,quad_mod$fitted.values, col="blue", lwd = 1)
Is this model superior to the simple linear regression model?
In order to answer this question, we can conduct the ANOVA test to compare the fits of these two models.
anova(lm_mod,quad_mod)
We can also try higher order polynomial regress, for example a fifth-degree polynomial.
poly_mod <- lm(Y~poly(X,5,raw=T),data=mcycle)
summary(poly_mod)
##
## Call:
## lm(formula = Y ~ poly(X, 5, raw = T), data = mcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.271 -21.285 0.975 25.386 82.371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.059e+02 3.499e+01 -3.026 0.003 **
## poly(X, 5, raw = T)1 4.978e+01 1.036e+01 4.804 4.31e-06 ***
## poly(X, 5, raw = T)2 -6.359e+00 1.015e+00 -6.266 5.30e-09 ***
## poly(X, 5, raw = T)3 2.969e-01 4.253e-02 6.982 1.45e-10 ***
## poly(X, 5, raw = T)4 -5.742e-03 7.932e-04 -7.239 3.83e-11 ***
## poly(X, 5, raw = T)5 3.919e-05 5.406e-06 7.250 3.60e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.9 on 127 degrees of freedom
## Multiple R-squared: 0.5264, Adjusted R-squared: 0.5078
## F-statistic: 28.24 on 5 and 127 DF, p-value: < 2.2e-16
You can also assess the model performance.
#poly_mod_summary <- summary(poly_mod)
#(poly_mod_summary$sigma)^2
#poly_mod_summary$r.squared
#poly_mod_summary$adj.r.squared
#AIC(poly_mod)
#BIC(poly_mod)
plot(X ,Y ,xlab="Times", main = "Fifth-degree polynomial",ylab="Acceleration",cex=.5)
lines(X,poly_mod$fitted.values, col="blue", lwd = 1)
This one may be even better to capture the curvature of samples on the left hand side of scatter plot than those two model above.
In order to fit regression splines in R, we use the bs()
function from the splines library. By default, “cubic splines” are produced. That is cubic polynomial with no interior knots
library (splines)
reg_sp <- lm(Y~bs(X),data=mcycle)
summary(reg_sp)
##
## Call:
## lm(formula = Y ~ bs(X), data = mcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.083 -28.772 2.935 31.004 94.108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.26 15.42 2.675 0.008433 **
## bs(X)1 -258.14 41.07 -6.285 4.66e-09 ***
## bs(X)2 110.67 28.33 3.906 0.000151 ***
## bs(X)3 -72.91 27.99 -2.605 0.010274 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40 on 129 degrees of freedom
## Multiple R-squared: 0.3303, Adjusted R-squared: 0.3147
## F-statistic: 21.21 on 3 and 129 DF, p-value: 3.134e-11
plot(X ,Y ,xlab="Times", main = "Regression Spline",ylab="Acceleration",cex=.5)
lines(X,reg_sp$fitted.values, col="blue", lwd = 1)
conf_interval <- predict(reg_sp, interval="confidence",
level = 0.95)
lines(X, conf_interval[,2], col="red", lty=2)
lines(X, conf_interval[,3], col="red", lty=2)
You can also specify the knot locations.
#fit_sp=lm(Y~bs(X,knots=c(15.6,23.4,34.8)),data=mcycle)
#summary(fit_sp)
#AIC(fit_sp)
You can also specify the degree of freedom.
reg_sp2=lm(Y~bs(X,df=10),data=mcycle)
plot(X ,Y ,xlab="Times", main = "Regression Spline with df=10",ylab="Acceleration",cex=.5)
lines(X,reg_sp2$fitted.values, col="blue", lwd = 1)
conf_interval <- predict(reg_sp2, interval="confidence",
level = 0.95)
lines(X, conf_interval[,2], col="red", lty=2)
lines(X, conf_interval[,3], col="red", lty=2)
Here the degree of freedom is pre-specified and different numbers are used to see the best curve that fits the data.
#First: Natural Spline- pre-specified degree of freedom=4
fit2=lm(Y~ns(X,df=4),data=mcycle)
plot(X ,Y,main= "Natural Cubic Spline with df=4", xlab="Times", ylab="Acceleration")
lines(X, fit2$fitted.values)
conf_interval <- predict(fit2, interval="confidence",
level = 0.95)
lines(X, conf_interval[,2], col="red", lty=2)
lines(X, conf_interval[,3], col="red", lty=2)
fit2c=lm(Y~ns(X,df=10),data=mcycle)
plot(X ,Y , main= "Natural Cubic Spline with df=10", xlab="Times", ylab="Acceleration")
lines(X, fit2c$fitted.values)
conf_interval <- predict(fit2c, interval="confidence",
level = 0.95)
lines(X, conf_interval[,2], col="red", lty=2)
lines(X, conf_interval[,3], col="red", lty=2)
fit2d=lm(Y~ns(X,df=20),data=mcycle)
plot(X ,Y, main= "Natural Cubic Spline with df=20", xlab="Times", ylab="Acceleration")
lines(X, fit2d$fitted.values)
conf_interval <- predict(fit2d, interval="confidence",
level = 0.95)
lines(X, conf_interval[,2], col="red", lty=2)
lines(X, conf_interval[,3], col="red", lty=2)
We now use penalized splines where a penalty/smoothing parameter can help control the smoothness while many knots can be used and knot location does not need to be carefully selected. The s()
function is part of the gam function from the mgcv package.