Everyone knows that traffic is much worse during the week than Sundays. But how much so? And how much does traffic change from month to month? Does traffic decrease in the summer because people take vacation, or do those vacationers clog the roadways? Given all of these fluctuations in traffic, how do we quantify long term trends?

In data science, these questions fall are referred to as seasonality. A wonderful technique to address these questions is Seasonal and Trend decomposition using LOESS (STL). This the first of a three part series on STL. In this post, we’ll look at a test case of traffic from the John F. Kennedy Bridge in Manhattan. Part II delves into the weeds of how this works, and Part III discusses how STL can be used for imputing data over missing values, a key advantage over other means of decomposition.

The Data

The dataset we have is of total daily tolls across the JFK bridge. The data were scrapped from the MTA. All of the data aggregation and analysis including this post in .Rmd form is available on GitHub. The cleaned data is a time series of daily data from March 2012 through September 2016.

head(jfkManhattan)
## Source: local data frame [6 x 2]
## 
##         Date  Total
##       (date)  (dbl)
## 1 2012-03-04 71.236
## 2 2012-03-05 80.855
## 3 2012-03-06 84.055
## 4 2012-03-07 86.691
## 5 2012-03-08 92.716
## 6 2012-03-09 91.656

From the plot at the top of the post, there is a clear yearly pattern. Traffic peaks in the summer and drops down, bottoming out around the January or February. On top of this pattern, there is a general increase in traffic over time. Additionally, there are variations that depend on the day of the week. This is a bit harder to see because of the density of data, but the daily variation is what causes the figure to look like a wide band.

Weekly Decomposition

Let’s start out by looking at the daily variation. The decomposition takes the average daily traffic for each month as the sum of three components: a seasonal component, a trend component, and the remainder. For each day \(t\), the traffic can be written as

\[ DailyTraffic_t = Seasonal_t + Trend_t + Remainder_t \]

STL is a particular algorithm to make this separation. The stl function exists in base R, but the stlplus package implements the same algorithm while allowing for missing data and has some nicer plotting features. The details of the algorithm are address in PartII, but the important parameter is n.p, the number of measurements in a full period of seasonal behavior. Since we are looking for a weekly effect, n.p should equal 7.

weekDays <- c("Su", "M", "Tu","W", "Th", "F", "Sa")

stlDaily <- stlplus(jfkManhattan$Total,t=jfkManhattan$Date,
                    n.p=7, s.window=25,
                    sub.labels=weekDays, sub.start=1)

plot(stlDaily, xlab="Date", ylab="Daily Vehicles (thous.)")