# Install necessary packages
list_packages <- c("AER", "dynlm", "fpp", "fpp2",
"forecast", "readxl", "stargazer", "scales",
"quantmod", "urca", "vars", "tseries", "sarima")
new_packages <- list_packages[!(list_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load necessary packages
lapply(list_packages, require, character.only = TRUE)
A time series model exhibits periodic behavior with period \(s\), if similarities in the time series data occurs after a \(s\) time interval.
A particular kind of periodic behavior is seasonal behavior in which there are similar “with-in-year” patterns from year to year, e.g., toy sales, airlines usage.
If the basic period is a month, then the period is \(s=12\). If the data is collected by quarters, then the period is s=4, and there are other possibilities where there can be more than one periodicity, e.g., daily observed data may weakly, monthly, quarterly, yearly patterns.
ARIMA models can be employed successfully to fit and forecast seasonal time series data. The techniques require no new development.
One useful representation of seasonal data is as data from a two way table. Here is an example
Year \(\downarrow\) + month \(\to\) | Jan | Feb | Mar | … … … … | Dec |
---|---|---|---|---|---|
\(1\) | \(Z_1\) | \(Z_2\) | \(Z_3\) | … … … … | \(Z_{12}\) |
\(2\) | \(Z_{13}\) | \(Z_{14}\) | \(Z_{15}\) | … … … … | \(Z_{24}\) |
\(3\) | \(Z_{25}\) | \(Z_{26}\) | \(Z_{27}\) | … … … … | \(Z_{36}\) |
… | … | … | … | … … … … | … |
\(r\) | \(Z_{12(r-1)+1}\) | \(Z_{12(r-1)+2}\) | \(Z_{12(r-1)+3}\) | … … … … | \(Z_{12r}\) |
Within each year (row), the data represents a type of month to month pattern.
Within each month (column), the data represents a type of year to year time pattern.
This is the basic for one useful method for modeling seasonal time series data by multiplicative models. This model attempts to model these two types of patterns separately and then combine both features afterwards.
More specifically, we will build two models: one ARIMA for the column and one ARIMA for the row. We will combine these two models to have seasonal ARIMA.
The assumptions and notations
\(s\)-backward operator: \(B^s\)
\(B^s Z_t = Z_{t-s}\)
\(s\)-difference operator: \(\nabla_s = 1-B^s\)
\(\nabla_s Z_t = Z_t - Z_{t-s}\)
\(\nabla_s^D Z_t = (1-B^s)^D Z_t\) is the \(D\)-th difference operator on the between period time series data.
Denote the \(P\) AR parameters of this model as \(\Phi_s\), \(\Phi_{2s}\),…, and \(\Phi_{Ps}\), and the autoregressive polynomial is defined as \[\Phi_P(B^s)=1-\Phi_s B^s - \Phi_{2s} B^{2s} - ... - \Phi_{Ps} B^{Ps}.\]
Denote the \(Q\) MA parameters of this model as \(\Theta_s\), \(\Theta_{2s}\),…, and \(\Theta_{Qs}\), and the moving average polynomial is defined as \[\Theta_Q(B^s)=1-\Theta_s B^s - \Theta_{2s} B^{2s} - ... - \Theta_{Qs} B^{Qs}.\]
Let \(\nabla_s^D = (1-B^s)^D\), then the between period model, i.e., the pure seasonal ARIMA model, can be written as
\[ \Phi_P(B^s) \nabla_s^D \tilde{Z}_t = \Theta_Q(B^s)a_t \]
When \(D=0\), we have
\[ \tilde{Z}_t = \Phi_s \tilde{Z}_{t-s} + \Phi_{2s} \tilde{Z}_{t-2s} ... + \Phi_{Ps} \tilde{Z}_{t-Ps} + a_t - \Theta_s a_{t-s} - \Theta_{2s} a_{t-2s} - ... - \Theta_{2Q} a_{t-2Q} \]
\[ \tilde{Z}_t = \Phi_{12} \tilde{Z}_{t-12} + a_t \]
These pure seasonal models have so far treated the period to period process as an independent process, i.e., no correlation at the lag other than the integer multiple of \(s\), which is a deficiency.
Therefore, to take consideration of all the dynamics (both between period and within period), we further assume the error component \(a_t\) are correlated. In fact, we model these \(a_t\) as another ARIMA(p,d,q), i.e., \(a_t\) satisfies
\[ (1-\phi_1B-...-\phi_p B^p)(1-B)^d a_t = (1-\theta_1 B - ... - \theta_q B^q) \epsilon_t \] where \(\epsilon_t\) is white noise. It can be written as
\[ \phi(B)(1-B)^d a_t = \theta(B) \epsilon_t \]
Or equivalently,
\[ a_t = \phi^{-1}(B)(1-B)^{-d} \theta(B) \epsilon_t \]
Substitute it into the pure seasonal model \(\Phi_P(B^s)\nabla_s^D \tilde{Z}_t = \Theta_Q(B^s)a_t\), we have the seasonal ARIMA model as
\[ \Phi_P(B^s)\nabla_s^D \tilde{Z}_t = \Theta_Q(B^s)\phi^{-1}(B)(1-B)^{-d} \theta(B) \epsilon_t \]
which is equivalent to
\[ \Phi_P(B^s) \phi(B) (1-B^s)^D (1-B)^{d} \tilde{Z}_t = \Theta_Q (B^s) \theta(B) \epsilon_t \]
This is called the Box-Jenkins Seasonal Multiplicative Model of order \((p,d,q)\times (P,D,Q)_s\).
The nonseasonal part (within period) is governed by \((p,d,q)\) with AR in \(p\) and MA in \(q\) and number of differencing in \(d\), \(Z_t-Z_{t-1}\). The seasonal part (between period) is governed by \((P,D,Q)_s\) with seasonal AR in \(P\) and seasonal MA in \(Q\) and the number of seasonal differencing in \(D\), \(Z_t-Z_{t-s}\).
\[ (1-\Phi_{12} B^{12}) \tilde{Z}_t = a_t - \Theta_{12} a_{t-12} \]
\[ (1-\phi_{1} B) (1-\Phi_{12} B^{12})\tilde{Z}_t = a_t \]
n=200000
ts1 <- sim_sarima(n,model=list(ar=0.4, sar=0.8, nseasons=12, sigma2 = 1))
par(mfrow=c(1,3))
acf(ts1,ylim=c(-1,1))
pacf(ts1,ylim=c(-1,1))
plot(ts1[1:100],type="l")
It can be proved that the ACF is symmetric around period intervals, i.e., \(\rho_{s+1}=\rho_{s-1}\).
\[ \tilde{Z}_t = (1-\Theta_{12} B^{12})(1-\theta_{1}B)a_t \]
\[ \tilde{Z}_t = a_t - \Theta_s a_{t-s} \]
\[ (1-\phi_1 B )(1-B^{12})\tilde{Z}_t = a_t \]
\[ (1- B )(1-B^{12})\tilde{Z}_t = a_t \]
\[ (1- \phi_1 B )(1- \Phi_{12} B^{12})\tilde{Z}_t = a_t \]
n=200000
ts1 <- sim_sarima(n,model=list(sar=0.9,ar=0.9, nseasons=12, sigma2 = 1))
par(mfrow=c(1,3))
acf(ts1,ylim=c(-1,1))
pacf(ts1,ylim=c(-1,1))
plot(ts1[1:100],type="l")
These examples give us some intuition of the ACF of seasonal ARIMA.
Seasonal ARIMA presents no new problems in terms of diagnostic checking. We simply check adequacy of the a ARIMA model.
Forecast also presents no new challenges.
The condition of stationarity and invertibility for seasonal ARIMA is a direct extension of regular ARIMA. Namely, the seasonal ARIMA is stationary if the roots of both \(\phi_p(B)=0\) and \(\Phi_P(B^s)=0\) lie outside of unit circle. The seasonal ARIMA is invertible if the roots of both \(\theta_q(B)=0\) and \(\Theta_Q(B^s)=0\) lie outside of unit circle.
Case 9: air-carrier freight
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1969 1299 1148 1345 1363 1374 1533 1592 1687 1384 1388 1295 1489
## 1970 1403 1243 1466 1434 1520 1689 1763 1834 1494 1439 1327 1554
## 1971 1405 1252 1424 1517 1483 1605 1775 1840 1573 1617 1485 1710
## 1972 1563 1439 1669 1651 1654 1847 1931 2034 1705 1725 1687 1842
## 1973 1696 1534 1814 1796 1822 2008 2088 2230 1843 1848 1736 1826
## 1974 1766 1636 1921 1882 1910 2034 2047 2195 1765 1818 1634 1818
## 1975 1698 1520 1820 1689 1775 1968 2110 2241 1803 1899 1762 1901
## 1976 1839 1727 1954 1991 1988 2146 2301 2338 1947 1990 1832 2066
## 1977 1952 1747 2098 2057 2060 2240 2425 2515 2128 2255 2116 2315
## 1978 2143 1948 1460 2344 2363 2630 2811 2972 2515 2536 2414 2545
Visual check finds the variance increases as time goes by, therefore, we use box-cox transformation with lambda=0, i.e., log transformation.
The variance is stablized. Because of the ACF, we consider it to be nonstationary and take a 1st order difference.
The ACF at lags 12, 24, 36 are still decreasing slowly, therefore, we take a seasonal 1st order difference.
## Warning in adf.test(.): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -6.1843, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
The ACF now seems to be stationary. Perform ADF test for stationarity also confirms it.
The spike at lag 1 suggests a MA(1). The spike at lag 12 suggests a seasonal MA(1). Therefore, we fit a ARIMA(0,0,1)(0,0,1)[12] to the both seasonally and nonseasonally differenced series.
(case9 %>% log() %>% diff() %>% diff(lag=12) %>%
Arima(order=c(0,0,1), seasonal=c(0,0,1),include.constant = FALSE))
## Series: .
## ARIMA(0,0,1)(0,0,1)[12] with zero mean
##
## Coefficients:
## ma1 sma1
## -0.6747 -0.5616
## s.e. 0.0803 0.1891
##
## sigma^2 estimated as 0.0035: log likelihood=149.15
## AIC=-292.3 AICc=-292.06 BIC=-284.28
This model is equivalent to fit a ARIMA(0,1,1)(0,1,1)[12] to the original series with a log transformation. We further check its residuals.
## Series: case9
## ARIMA(0,1,1)(0,1,1)[12]
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ma1 sma1
## -0.6747 -0.5615
## s.e. 0.0803 0.1891
##
## sigma^2 estimated as 0.003506: log likelihood=149.15
## AIC=-292.3 AICc=-292.06 BIC=-284.28
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(0,1,1)[12]
## Q* = 6.4077, df = 22, p-value = 0.9995
##
## Model df: 2. Total lags used: 24
The residuals looks all right. Finally, we take our selected model to forecast the next 3 years.
Note that we can compare the model with auto.arima
and the suggested model is slightly different.
## Series: log(case9)
## ARIMA(1,0,1)(1,1,0)[12] with drift
##
## Coefficients:
## ar1 ma1 sar1 drift
## 0.8730 -0.5277 -0.4438 0.0049
## s.e. 0.0792 0.1311 0.1397 0.0012
##
## sigma^2 estimated as 0.003454: log likelihood=153.48
## AIC=-296.95 AICc=-296.36 BIC=-283.54
European quarterly retail trade: We will describe the seasonal ARIMA modelling procedure using quarterly European retail trade data from 1996 to 2011.
## Qtr1 Qtr2 Qtr3 Qtr4
## 1996 89.13 89.52 89.88 90.12
## 1997 89.19 89.78 90.03 90.38
## 1998 90.27 90.77 91.85 92.51
## 1999 92.21 92.52 93.62 94.15
## 2000 94.69 95.34 96.04 96.30
## 2001 94.83 95.14 95.86 95.83
## 2002 95.73 96.36 96.89 97.01
## 2003 96.66 97.76 97.83 97.76
## 2004 98.17 98.55 99.31 99.44
## 2005 99.43 99.84 100.32 100.40
## 2006 99.88 100.19 100.75 101.01
## 2007 100.84 101.34 101.94 102.10
## 2008 101.56 101.48 101.13 100.34
## 2009 98.93 98.31 97.67 97.44
## 2010 96.53 96.56 96.51 96.70
## 2011 95.88 95.84 95.79 95.94
## Warning in adf.test(.): p-value greater than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -0.17892, Lag order = 3, p-value = 0.99
## alternative hypothesis: stationary
## Warning in adf.test(.): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -5.5772, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
The significant spike at lag 1 in the ACF suggests a non-seasonal MA(1) component, and the significant spike at lag 4 in the ACF suggests a seasonal MA(1) component. Consequently, we begin with an ARIMA(0,1,1)(0,1,1)[4] model
Both the ACF and PACF show significant spikes at lag 2, and almost significant spikes at lag 3, indicating that some additional non-seasonal terms need to be included in the model.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,3)(0,1,1)[4]
## Q* = 0.51128, df = 4, p-value = 0.9724
##
## Model df: 4. Total lags used: 8
We tried other models with AR terms as well, but none that gave a smaller AICc value.
we now have a seasonal ARIMA model that passes the required checks and is ready for forecasting.
We could have used auto.arima() to do most of this work for us
## Series: euretail
## ARIMA(0,1,3)(0,1,1)[4]
##
## Coefficients:
## ma1 ma2 ma3 sma1
## 0.2630 0.3694 0.4200 -0.6636
## s.e. 0.1237 0.1255 0.1294 0.1545
##
## sigma^2 estimated as 0.156: log likelihood=-28.63
## AIC=67.26 AICc=68.39 BIC=77.65
The auto.arima()
function uses nsdiffs()
to determine \(D\) (the number of seasonal differences to use), and ndiffs()
to determine \(d\) (the number of ordinary differences to use).
## [1] 1
## [1] 2
Example: Corticosteroid drug sales in Australia
lh02 <- log(h02)
cbind("H02 sales (million scripts)" = h02,
"Log H02 sales"=lh02) %>%
autoplot(facets=TRUE) + xlab("Year") + ylab("")
The data are strongly seasonal and obviously non-stationary, so seasonal differencing will be used. The seasonally differenced data are shown below. It is not clear at this point whether we should do another difference or not. We decide not to, but the choice is not obvious.
In the plots of the seasonally differenced data, there are spikes in the PACF at lags 12 and 24, but nothing at seasonal lags in the ACF. This may be suggestive of a seasonal AR(2) term.
In the non-seasonal lags, there are three significant spikes in the PACF, suggesting a possible AR(3) term. The pattern in the ACF is not indicative of any simple model.
We fit this model, along with some variations on it
## Series: h02
## ARIMA(3,0,0)(2,1,0)[12]
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 ar3 sar1 sar2
## 0.0983 0.4060 0.4311 -0.4361 -0.290
## s.e. 0.0695 0.0611 0.0726 0.0795 0.078
##
## sigma^2 estimated as 0.004622: log likelihood=243.79
## AIC=-475.58 AICc=-475.12 BIC=-456.03
## Series: h02
## ARIMA(3,0,1)(0,1,2)[12]
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 ar3 ma1 sma1 sma2
## -0.1603 0.5481 0.5678 0.3827 -0.5222 -0.1768
## s.e. 0.1636 0.0878 0.0942 0.1895 0.0861 0.0872
##
## sigma^2 estimated as 0.004278: log likelihood=250.04
## AIC=-486.08 AICc=-485.48 BIC=-463.28
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,1)(0,1,2)[12]
## Q* = 50.712, df = 30, p-value = 0.01045
##
## Model df: 6. Total lags used: 36
The model fails the Ljung-Box test.
Next we will try using the automatic ARIMA algorithm. Running auto.arima()
with all arguments left at their default values
## Series: lh02
## ARIMA(2,1,1)(0,1,2)[12]
##
## Coefficients:
## ar1 ar2 ma1 sma1 sma2
## -1.1358 -0.5753 0.3683 -0.5318 -0.1817
## s.e. 0.1608 0.0965 0.1884 0.0838 0.0881
##
## sigma^2 estimated as 0.004278: log likelihood=248.25
## AIC=-484.51 AICc=-484.05 BIC=-465
However, the model still fails the Ljung-Box test for 36 lags
Sometimes it is just not possible to find a model that passes all of the tests.
h02 %>%
Arima(order=c(3,0,1), seasonal=c(0,1,2), lambda=0) %>%
forecast() %>%
autoplot() +
ylab("H02 sales (million scripts)") + xlab("Year")
Case 10 After tax profits measured in cents per dollar of sales for all US manufacturing companies. Quarterly data from 1953-1972.
case10=ts(scan("case10.txt"),start=c(1953,1), end=c(1972,4),frequency=4)
case10 %>% ggtsdisplay(lag.max=40)
From the ACF and PACF, we don’t see anything similar to nonstationary time series. However, we take 1st order difference anyway.
This is important because taking 1st order difference should be done routinely whenever the data might have seasonal or other periodic variation, even if the nonseasonal difference is not needed. Often the nature of a seasonal pattern emerges more clearly in the acf of the differenced series.
The ACF at lags 4 and 8 suggest a seasonal MA(2). Note that the PACF at lags 4, 8, 12, and 16 are expoentially decaying. Therefore, maybe we should fit an SARIMA(1,0,0)(0,0,2)[4]. Note that we didn’t need to take the first order difference, but took it anyway. That’s why it is AR(1) not \(d=1\).
(case10 %>% Arima(order=c(1,0,0), seasonal=c(0,0,2)))
case10 %>% Arima(order=c(1,0,0), seasonal=c(0,0,2)) %>% checkresiduals()
It seems the SARIMA(1,0,0)(0,0,2)[4] fits the data well.
Even if we miss the 1st order difference at the begining, we can still identify this model. Here is another way. Suppose we start with AR(1) based on the ACF and PACF.
From the residuals ACF and PACF, it suggest a seasonal MA(2). So we add it to the original model.
Finally, we take this model and perform forecasting.
## Series: .
## ARIMA(1,0,0)(0,0,2)[4] with non-zero mean
##
## Coefficients:
## ar1 sma1 sma2 mean
## 0.9489 -0.5920 -0.4079 4.7540
## s.e. 0.0440 0.1542 0.1164 0.0658
##
## sigma^2 estimated as 0.05145: log likelihood=2.15
## AIC=5.7 AICc=6.52 BIC=17.61