This part of a three part series on STL decomposition focuses on a sketch of the algorithm. It is not a rigorous treatment, but hopefully thorough enough to provide a mathematical understanding of how the various hyperparameters affect the decomposition. This post is a bit heavy on the mathematics. For an introduction to STL, please look at Part I

STL (Seasonal Trend decomposition using Loess) was developed by Cleveland et al. [(Journal of Official Statistics 6 No. 1 pp 3-33 1990)] (http://www.wessa.net/download/stl.pdf). The core idea is that a time series can be decomposed into three components: seasonal, trend and remainder (\(Y_\nu = T_\nu + S_\nu + R_\nu\)) for \(\nu = 1\) to \(N\) measured data points. The algorithmic details is not commonly discussed in time series texts in part because unlike other methods, there is no notion of a loss function to be minimized. This is because the notional of seasonal variation is always intrinsically ambiguous: whether the temporal variation should be considered Seasonal, Trend, or Remainder is, to a degree, a matter of opinion and determined by choice of model and model parameters. This is true in STL as well as any seasonal variational approach.

Key to the STL approach is Loess (LOcal regrESSion) smoothing. For a set of measurements \(y_i\), and \(x_i\), Loess provides smooth estimate \(g(x)\) for \(y\) at all values of \(x\), not just at values \(x_i\) for which \(y\) has been measured. To calculate \(g\), a positive integer \(q\) is chosen. A larger \(q\), yields greater smoothing. The \(q\) values of \(x_i\) that are closes to \(x\) are selected and each is weighted by how far it is from \(x\). The weight is determined as \(v_i = W(|x_i−x|\)) where \(W\) is the tricube weight function and \(\lambda_q (x)\) is the distance from the \(q^{th}\) farthest point (for \(q < N\). If \(q >= N\), additional scale terms must be used). A polynomial of degree \(d\) (typically one or two) is fit to the selected \((x_i,y_i)\) with weights \(v_i\). In addition, it is possible to have a set of robustness weights \(\rho_i\) for each data point \((x_i, y_i)\). These weights allow for some data points to be considered more heavily in the regression. If robustness weights exist, use weights \(\rho_i v_i\). For STL, selection of smoothing parameters \(q\) and to a lesser extent \(d\) are a key model choice.

Algorithmic sketch

The goal is to separate a time series \(Y_\nu\) for \(\nu = 1\) to \(N\) into \(Y_\nu = T_\nu + S_\nu + R_\nu\), trend, seasonal and remainder components. This is done through two loops. In the outer loop, robustness weights are assigned to each data point depending on the size of the remainder. This allows for reducing or eliminating the effects of outliers. The inner loop interatively updates the trend and seasonal components. This is done by subtracting the current estimate of the trend from the raw series. The time series is then partition into cycle-subseries (e.g. if it is monthly data with a yearly season, then there will be 12 cycle subseries: all Januarys will be one TS, all February a second, etc.). The cycle-subseries are loess smoothed and then passed thorugh a low-pass filter. The seasonal components are the smoothed cycle-subseries minus the result from the low-pass filter. The seasonal components are subtracted from the raw data. The result is loess smoothed, which becomes the trend. What is left is the remainder. In the outline below, the notation follows the Cleveland paper.

  1. Initialize trend as \(T_\nu^{(0)} = 0\) and \(R_\nu^{(0)}=0\)
  2. Outer loop - Calculate robustness weights. Run \(n_{(o)}\) times
  1. Inner loop - Iteratively calculate trend and seasonal terms. Run \(n_{(i)}\) times

Model parameters

There are six major parameters in the model. The parameter names used in the stlplus package are in parentheses.

In addition to these six primary paramters, the degree of the loess smoothing can be changed, though this is hardly ever needed. The default is typically \(d\) = 1 for all the smoothing. There are more paramters in the stlplus function related to minutia of handling the ends of the time series (such as which cycle-subseries is at the beginning) and some that are related to the computation parameters and (I believe) should not affect the resulting model.

Of all of these parameters, the most important for the modeler to consider are \(n_{(p)}\) and \(n_{(s)}\). The decision of which value of \(n_{(p)}\) comes from what seasonal behavior is wanted to be captured. As shown in Part I, traffic has weekly seasonality, but also monthly variation. If multiple seasonalities exist, it is typically best to handle the highest frequency variations first, then aggregate and analyze the slow variations: e.g. find the weekly variation first. Account for that seasonality and then aggregate to monthly data to look at monthly variations.

\(n_{(s)}\) controls the smoothing of the seasonal data. This plays a huge role in determining what variation is considered ‘seasonal’ versus ‘trend’ or ‘remainder.’ The smaller the value, the less smoothing in each cycle-subseries. This causes more of the variation to be captured in the seasonal component. To see this in action, we’ll look again at the traffic data from Part I.

As a reminder, the procedure for performing the stl analysis is

library(stlplus)
weekDays <- c("Su", "M", "Tu","W", "Th", "F", "Sa")

stlDaily <- stlplus(jfkManhattan$Total,t=jfkManhattan$Date,
                    n.p=7, s.window=25,
                    sub.labels=weekDays, sub.start=1)

Plots of the cycle-subseries, using the plot_seasonal function, show how the seasonal term is being smoothed. These plots show each cycle-subseries after subtracting off the trend component and the average value so that each cycle-subseries is centered at zero. The circles in each plot are the data and the line is the result of the seasonal smoothing. The plots below show a range of other values for \(n_{(s)}\):

With \(n_{(s)}=5\), there is not enough smoothing to the seasonal component. The line moves too quickly to compensate for changes that should probably be

[1] \(B(u) = (1 - u2)2\) for \(0 u 1\) \(B(u) = 0\) for all other \(u\)