# Install necessary packages
list_packages <- c("AER", "dynlm", "tidyverse", "fpp", "fpp2",
"forecast", "readxl", "stargazer", "scales",
"quantmod", "urca", "vars", "tseries", "sarima")
new_packages <- list_packages[!(list_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load necessary packages
lapply(list_packages, require, character.only = TRUE)
So far, we have dealt with univariate time series data, what about multiple variables?
The time series models in the previous chapters allow for the inclusion of information from past observations, but not for the inclusion of other variables.
We considered regression models of the form \[ Z_t=\beta_0 + \beta_1 X_{1,t} + ... \beta_k X_{k,t} + \epsilon_t \]
In other words, \(Z_t\) is a linear combination of \(X_{1,t}\) … \(X_{k,t}\). When \(\epsilon_t\) is assumed to be an uncorrelated error term (i.e., it is white noise), this is a typical multiple linear regression.
We further allow the errors from a regression to contain autocorrelation. To emphasise this change, we replace \(\epsilon_t\) with \(\eta_t\). The error series \(\eta_t\) is assumed to follow an ARIMA model. For example, if \(\eta_t\) follows an ARIMA(1,1,1) model, we can write
\[ Z_t=\beta_0 + \beta_1 X_{1,t} + ... \beta_k X_{k,t} + \eta_t\\ (1-\phi_1 B)(1- B)\eta_t = (1-\theta_1 B) a_t \]
Notice that the model has two error terms here — the error from the regression model, which we denote by \(\eta_t\), and the error from the ARIMA model, which we denote by \(a_t\). Only the ARIMA model errors are assumed to be white noise.
Note that if \(X_{1,t}\),…,\(X_{k,t}\) are excluded from the model, the model becomes an ARIMA model.
We cannot use least square estimate for this model because:
We use maximum likelihood estimation.
An important consideration when estimating a regression with ARMA errors is that all of the variables in the model must first be stationary. Thus, we first have to check that \(Z_t\) and all of the predictors appear to be stationary. If we estimate the model when any of these are non-stationary, the estimated coefficients will not be consistent estimates (and therefore may not be meaningful). One exception to this is the case where non-stationary variables are co-integrated. If there exists a linear combination of the non-stationary \(Z_t\) and the predictors that is stationary, then the estimated coefficients will be consistent. We will explain this later.
It is often desirable to maintain the form of the relationship between \(Z_t\) and the predictors, and consequently it is common to difference all of the variables if any of them need differencing. The resulting model is then called a “model in differences”, as distinct from a “model in levels”, which is what is obtained when the original data are used without differencing.
If the above regression model with ARIMA(1,1,1) errors is differenced we obtain the model
\[ \nabla Z_t=\beta_0 + \beta_1 \nabla X_{1,t} + ... \beta_k \nabla X_{k,t} + \nabla \eta_t\\ (1-\phi_1 B) \nabla \eta_t = (1-\theta_1 B) a_t \]
R function Arima()
will fit a regression model with ARIMA errors if the argument xreg
is used. The order argument specifies the order of the ARIMA error model.
The auto.arima()
function will also handle regression terms via the xreg
argument. The user must specify the predictor variables to include, but auto.arima()
will select the best ARIMA model for the errors.
If differencing is required, then all variables are differenced during the estimation process, although the final model will be expressed in terms of the original variables. To include a constant in the differenced model, specify include.drift=TRUE
.
Here is an example
We analyze the quarterly changes in personal consumption expenditure and personal disposable income from 1970 to 2016 Q3. We would like to forecast changes in expenditure based on changes in income.
## Consumption Income Production Savings Unemployment
## 1970 Q1 0.6159862 0.9722610 -2.4527003 4.8103115 0.9
## 1970 Q2 0.4603757 1.1690847 -0.5515251 7.2879923 0.5
## 1970 Q3 0.8767914 1.5532705 -0.3587079 7.2890131 0.5
## 1970 Q4 -0.2742451 -0.2552724 -2.1854549 0.9852296 0.7
## 1971 Q1 1.8973708 1.9871536 1.9097341 3.6577706 -0.1
## 1971 Q2 0.9119929 1.4473342 0.9015358 6.0513418 -0.1
autoplot(uschange[,1:2], facets=TRUE) +
xlab("Year") + ylab("") +
ggtitle("Quarterly changes in US consumption and personal income")
The data are clearly already stationary
## Series: uschange[, "Consumption"]
## Regression with ARIMA(1,0,2) errors
##
## Coefficients:
## ar1 ma1 ma2 intercept xreg
## 0.6922 -0.5758 0.1984 0.5990 0.2028
## s.e. 0.1159 0.1301 0.0756 0.0884 0.0461
##
## sigma^2 estimated as 0.3219: log likelihood=-156.95
## AIC=325.91 AICc=326.37 BIC=345.29
## Series: uschange[, "Consumption"]
## Regression with ARIMA(1,0,2) errors
##
## Coefficients:
## ar1 ma1 ma2 intercept xreg
## 0.6922 -0.5758 0.1984 0.5990 0.2028
## s.e. 0.1159 0.1301 0.0756 0.0884 0.0461
##
## sigma^2 estimated as 0.3219: log likelihood=-156.95
## AIC=325.91 AICc=326.37 BIC=345.29
The fitted model is
\[ Z_t = 0.599 + 0.203 X_t + \eta_t,\\ \eta_t = 0.692 \eta_{t-1} + a_t + 0.576 a_{t−1} - 0.198 a_{t−2} \]
where \(a_t\) is white noise.
We can recover estimates of both the \(\eta_t\) and \(a_t\) series using the residuals()
. function.
cbind("Regression Errors" = residuals(fit, type="regression"),
"ARIMA errors" = residuals(fit, type="innovation")) %>%
autoplot(facets=TRUE)
It is the ARIMA errors that should resemble a white noise series.
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(1,0,2) errors
## Q* = 5.8916, df = 3, p-value = 0.117
##
## Model df: 5. Total lags used: 8
uschange %>%
as.data.frame() %>%
ggplot(aes(x=Income, y=Consumption)) +
ylab("Consumption (quarterly % change)") +
xlab("Income (quarterly % change)") +
geom_point() +
geom_smooth(method="lm", se=FALSE)
To forecast using a regression model with ARIMA errors, we need to forecast the regression part of the model and the ARIMA part of the model, and combine the results.
When the predictors are themselves unknown, we must either model them separately, or use assumed future values for each predictor.
We will calculate forecasts for the next eight quarters assuming that the future percentage changes in personal disposable income will be equal to the mean percentage change from the last forty years.
fcast <- forecast(fit, xreg=rep(mean(uschange[,2]),20))
autoplot(fcast) + xlab("Year") +
ylab("Percentage change")
The prediction intervals for this model are narrower than the ARIMA model for \(Z_t\) because we are now able to explain some of the variation in the data using the income predictor.
fit2 <- Arima(uschange[,"Consumption"], order=c(1,0,2))
fcast2 <- forecast(fit2,h=20)
autoplot(fcast2) + xlab("Year") +
ylab("Percentage change")
It is important to realise that the prediction intervals from regression models (with or without ARIMA errors) do not take into account the uncertainty in the forecasts of the predictors. So they should be interpreted as being conditional on the assumed (or estimated) future values of the predictor variables.
Suppose we are interested in the level of gas prices (Data) as a function of various explanatory variables.
We observe Gas Prices \(Y_t\) over \(n\) time periods, \(t = 1, 2, \dots, n\).
we can forecast \(Y\) purely against time and itself;
Or, we can apply a VAR by using possible variables: Personal Disposable Income Consumer Price Index Unemployment Housing Starts Price of crude oil
# my_data1 <- Gas_prices[,-c(1:3,10:29)]
# Import the data ---------------------------------------------------------
GasPrices<- readxl::read_xlsx(path = "Gas_prices_1.xlsx")
head(GasPrices)
## # A tibble: 6 x 29
## Date Year Month Unleaded Crude_price SP500 PDI CPI RR_sales Unemp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 15-J~ 1996 1 109. 18.9 636. 5660 154. 134926 5.6
## 2 15-F~ 1996 2 109. 19.1 640. 5713. 155. 136781 5.5
## 3 15-M~ 1996 3 114. 21.3 646. 5749. 156. 137527 5.5
## 4 15-A~ 1996 4 123. 23.5 654. 5733. 156. 137504 5.6
## 5 15-M~ 1996 5 128. 21.2 669. 5816 157. 138302 5.6
## 6 15-J~ 1996 6 126. 20.4 671. 5851. 157. 137881 5.3
## # ... with 19 more variables: Housing <dbl>, Demand <dbl>, Production <dbl>,
## # Imports <dbl>, Stocks <dbl>, Time <dbl>, L1_Unleaded <dbl>,
## # L1_Crude_price <dbl>, L1_SP500 <dbl>, L1_PDI <dbl>, L1_CPI <dbl>,
## # L1_RR_sales <dbl>, L1_Unemp <dbl>, L1_Housing <dbl>, L1_Demand <dbl>,
## # L1_Production <dbl>, L1_Imports <dbl>, L1_Stocks <dbl>, Recession <dbl>
if (!require("forecast")){install.packages("forecast")}
Unleaded <- ts(GasPrices$Unleaded, start=1996, frequency = 12, end=c(2015, 12))
# Transformation----------------------------------------------------------
log_Unleaded <- log(Unleaded)
# auto.arima
fit.auto <- auto.arima(log_Unleaded)
print(fit.auto)
## Series: log_Unleaded
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## 0.5588
## s.e. 0.0488
##
## sigma^2 estimated as 0.003014: log likelihood=354.82
## AIC=-705.64 AICc=-705.59 BIC=-698.69
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 46.458, df = 23, p-value = 0.002622
##
## Model df: 1. Total lags used: 24
##
## Call:
## lm(formula = Unleaded ~ Crude_price + CPI + Demand, data = GasPrices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.549 -10.814 -1.954 7.357 65.137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -48.26392 19.52878 -2.471 0.0142 *
## Crude_price 2.30659 0.06870 33.575 < 2e-16 ***
## CPI 0.76253 0.09357 8.149 2.14e-14 ***
## Demand -0.14675 0.14335 -1.024 0.3070
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.58 on 236 degrees of freedom
## Multiple R-squared: 0.9681, Adjusted R-squared: 0.9677
## F-statistic: 2386 on 3 and 236 DF, p-value: < 2.2e-16
fit <- auto.arima(GasPrices$Unleaded, xreg=as.matrix(GasPrices[,c("Crude_price", "CPI", "Demand")]))
summary(fit)
## Series: GasPrices$Unleaded
## Regression with ARIMA(1,0,2) errors
##
## Coefficients:
## ar1 ma1 ma2 intercept Crude_price CPI Demand
## 0.7134 0.4971 0.1281 -207.2742 1.6327 1.6563 0.3211
## s.e. 0.0955 0.0997 0.0950 57.5287 0.1951 0.3035 0.2907
##
## sigma^2 estimated as 92.26: log likelihood=-880.77
## AIC=1777.55 AICc=1778.17 BIC=1805.39
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.06211942 9.463917 6.632496 -0.1367517 3.007046 0.6452159
## ACF1
## Training set -0.00322775