# Forecasting Time Series Groups in the tidyverse

#### 2018-03-03

Extending broom to time series forecasting

One of the most powerful benefits of sweep is that it helps forecasting at scale within the “tidyverse”. There are two common situations:

1. Applying a model to groups of time series
2. Applying multiple models to a time series

In this vignette we’ll review how sweep can help the first situation: Applying a model to groups of time series.

# Prerequisites

Before we get started, load the following packages.

library(forecast)
library(tidyquant)
library(timetk)
library(sweep)

# Bike Sales

We’ll use the bike sales data set, bike_sales, provided with the sweep package for this tutorial. The bike_sales data set is a fictional daily order history that spans 2011 through 2015. It simulates a sales database that is typical of a business. The customers are the “bike shops” and the products are the “models”.

bike_sales
## # A tibble: 15,644 x 17
##    order.date order.id order.line quantity price price.ext customer.id
##    <date>        <dbl>      <int>    <dbl> <dbl>     <dbl>       <dbl>
##  1 2011-01-07     1.00          1     1.00  6070      6070        2.00
##  2 2011-01-07     1.00          2     1.00  5970      5970        2.00
##  3 2011-01-10     2.00          1     1.00  2770      2770       10.0
##  4 2011-01-10     2.00          2     1.00  5970      5970       10.0
##  5 2011-01-10     3.00          1     1.00 10660     10660        6.00
##  6 2011-01-10     3.00          2     1.00  3200      3200        6.00
##  7 2011-01-10     3.00          3     1.00 12790     12790        6.00
##  8 2011-01-10     3.00          4     1.00  5330      5330        6.00
##  9 2011-01-10     3.00          5     1.00  1570      1570        6.00
## 10 2011-01-11     4.00          1     1.00  4800      4800       22.0
## # ... with 15,634 more rows, and 10 more variables: bikeshop.name <chr>,
## #   bikeshop.city <chr>, bikeshop.state <chr>, latitude <dbl>,
## #   longitude <dbl>, product.id <dbl>, model <chr>,
## #   category.primary <chr>, category.secondary <chr>, frame <chr>

We’ll analyse the monthly sales trends for the bicycle manufacturer. Let’s transform the data set by aggregating by month.

bike_sales_monthly <- bike_sales %>%
mutate(month = month(order.date, label = TRUE),
year  = year(order.date)) %>%
group_by(year, month) %>%
summarise(total.qty = sum(quantity))
bike_sales_monthly
## # A tibble: 60 x 3
## # Groups:   year [?]
##     year month total.qty
##    <dbl> <ord>     <dbl>
##  1  2011 Jan         440
##  2  2011 Feb        2017
##  3  2011 Mar        1584
##  4  2011 Apr        4478
##  5  2011 May        4112
##  6  2011 Jun        4251
##  7  2011 Jul        1550
##  8  2011 Aug        1470
##  9  2011 Sep         975
## 10  2011 Oct         697
## # ... with 50 more rows

We can visualize package with a month plot using the ggplot2 .

bike_sales_monthly %>%
ggplot(aes(x = month, y = total.qty, group = year)) +
geom_area(aes(fill = year), position = "stack") +
labs(title = "Quantity Sold: Month Plot", x = "", y = "Sales",
subtitle = "March through July tend to be most active") +
scale_y_continuous() +
theme_tq()

Suppose Manufacturing wants a more granular forecast because the bike components are related to the secondary category. In the next section we discuss how sweep can help to perform a forecast on each sub-category.

# Performing Forecasts on Groups

First, we need to get the data organized into groups by month of the year. We’ll create a new “order.month” date using zoo::as.yearmon() that captures the year and month information from the “order.date” and then passing this to lubridate::as_date() to convert to date format.

monthly_qty_by_cat2 <- bike_sales %>%
mutate(order.month = as_date(as.yearmon(order.date))) %>%
group_by(category.secondary, order.month) %>%
summarise(total.qty = sum(quantity))
monthly_qty_by_cat2
## # A tibble: 538 x 3
## # Groups:   category.secondary [?]
##    category.secondary order.month total.qty
##    <chr>              <date>          <dbl>
##  1 Cross Country Race 2011-01-01      122
##  2 Cross Country Race 2011-02-01      489
##  3 Cross Country Race 2011-03-01      505
##  4 Cross Country Race 2011-04-01      343
##  5 Cross Country Race 2011-05-01      263
##  6 Cross Country Race 2011-06-01      735
##  7 Cross Country Race 2011-07-01      183
##  8 Cross Country Race 2011-08-01       66.0
##  9 Cross Country Race 2011-09-01       97.0
## 10 Cross Country Race 2011-10-01      189
## # ... with 528 more rows

Next, we use the nest() function from the tidyr package to consolidate each time series by group. The newly created list-column, “data.tbl”, contains the “order.month” and “total.qty” columns by group from the previous step. The nest() function just bundles the data together which is very useful for iterative functional programming.

monthly_qty_by_cat2_nest <- monthly_qty_by_cat2 %>%
group_by(category.secondary) %>%
nest(.key = "data.tbl")
monthly_qty_by_cat2_nest
## # A tibble: 9 x 2
##   category.secondary data.tbl
##   <chr>              <list>
## 1 Cross Country Race <tibble [60 x 2]>
## 2 Cyclocross         <tibble [60 x 2]>
## 3 Elite Road         <tibble [60 x 2]>
## 4 Endurance Road     <tibble [60 x 2]>
## 5 Fat Bike           <tibble [58 x 2]>
## 6 Over Mountain      <tibble [60 x 2]>
## 7 Sport              <tibble [60 x 2]>
## 8 Trail              <tibble [60 x 2]>
## 9 Triathalon         <tibble [60 x 2]>

## Forecasting Workflow

The forecasting workflow involves a few basic steps:

1. Step 1: Coerce to a ts object class.
2. Step 2: Apply a model (or set of models)
3. Step 3: Forecast the models (similar to predict)
4. Step 4: Tidy the forecast

## Step 1: Coerce to a ts object class

In this step we map the tk_ts() function into a new column “data.ts”. The procedure is performed using the combination of dplyr::mutate() and purrr::map(), which works really well for the data science workflow where analyses are built progressively. As a result, this combination will be used in many of the subsequent steps in this vignette as we build the analysis.

### mutate and map

The mutate() function adds a column, and the map() function maps the contents of a list-column (.x) to a function (.f). In our case, .x = data.tbl and .f = tk_ts. The arguments select = -order.month, start = 2011, and freq = 12 are passed to the ... parameters in map, which are passed through to the function. The select statement is used to drop the “order.month” from the final output so we don’t get a bunch of warning messages. We specify start = 2011 and freq = 12 to return a monthly frequency.

monthly_qty_by_cat2_ts <- monthly_qty_by_cat2_nest %>%
mutate(data.ts = map(.x       = data.tbl,
.f       = tk_ts,
select   = -order.month,
start    = 2011,
freq     = 12))
monthly_qty_by_cat2_ts
## # A tibble: 9 x 3
##   category.secondary data.tbl          data.ts
##   <chr>              <list>            <list>
## 1 Cross Country Race <tibble [60 x 2]> <S3: ts>
## 2 Cyclocross         <tibble [60 x 2]> <S3: ts>
## 3 Elite Road         <tibble [60 x 2]> <S3: ts>
## 4 Endurance Road     <tibble [60 x 2]> <S3: ts>
## 5 Fat Bike           <tibble [58 x 2]> <S3: ts>
## 6 Over Mountain      <tibble [60 x 2]> <S3: ts>
## 7 Sport              <tibble [60 x 2]> <S3: ts>
## 8 Trail              <tibble [60 x 2]> <S3: ts>
## 9 Triathalon         <tibble [60 x 2]> <S3: ts>

## Step 2: Modeling a time series

Next, we map the Exponential Smoothing ETS (Error, Trend, Seasonal) model function, ets, from the forecast package. Use the combination of mutate to add a column and map to interatively apply a function rowwise to a list-column. In this instance, the function to map the ets function and the list-column is “data.ts”. We rename the resultant column “fit.ets” indicating an ETS model was fit to the time series data.

monthly_qty_by_cat2_fit <- monthly_qty_by_cat2_ts %>%
mutate(fit.ets = map(data.ts, ets))
monthly_qty_by_cat2_fit
## # A tibble: 9 x 4
##   category.secondary data.tbl          data.ts  fit.ets
##   <chr>              <list>            <list>   <list>
## 1 Cross Country Race <tibble [60 x 2]> <S3: ts> <S3: ets>
## 2 Cyclocross         <tibble [60 x 2]> <S3: ts> <S3: ets>
## 3 Elite Road         <tibble [60 x 2]> <S3: ts> <S3: ets>
## 4 Endurance Road     <tibble [60 x 2]> <S3: ts> <S3: ets>
## 5 Fat Bike           <tibble [58 x 2]> <S3: ts> <S3: ets>
## 6 Over Mountain      <tibble [60 x 2]> <S3: ts> <S3: ets>
## 7 Sport              <tibble [60 x 2]> <S3: ts> <S3: ets>
## 8 Trail              <tibble [60 x 2]> <S3: ts> <S3: ets>
## 9 Triathalon         <tibble [60 x 2]> <S3: ts> <S3: ets>

At this point, we can do some model inspection with the sweep tidiers.

### sw_tidy

To get the model parameters for each nested list, we can combine sw_tidy within the mutate and map combo. The only real difference is now we unnest the generated column (named “tidy”). Last, because it’s easier to compare the model parameters side by side, we add one additional call to spread() from the tidyr package.

monthly_qty_by_cat2_fit %>%
mutate(tidy = map(fit.ets, sw_tidy)) %>%
unnest(tidy, .drop = TRUE) %>%
spread(key = category.secondary, value = estimate)
## # A tibble: 16 x 10
##    term  Cross Country Race Cyclocross Elite Road Endurance Road
##    <chr>                <dbl>      <dbl>        <dbl>            <dbl>
##  1 alpha             0.0398     0.000110     0.0651           0.107
##  2 b                NA         NA           NA               NA
##  3 beta             NA         NA           NA               NA
##  4 gamma             0.000101   0.00256      0.000100         0.000899
##  5 l               321        210          490              394
##  6 s0                0.503      0.0788       0.871            0.312
##  7 s1                1.10       1.32         0.556            1.47
##  8 s10               0.643      0.212        0.266            0.730
##  9 s2                0.375      0.0969       0.617            1.13
## 10 s3                1.12       0.349        1.52             0.554
## 11 s4                0.630      1.34         0.663            1.06
## 12 s5                2.06       2.06         0.545            1.99
## 13 s6                0.873      2.01         1.69             1.14
## 14 s7                1.64       1.38         1.91             0.836
## 15 s8                0.487      2.29         1.27             0.694
## 16 s9                1.41       0.754        1.86             1.83
## # ... with 5 more variables: Fat Bike <dbl>, Over Mountain <dbl>,
## #   Sport <dbl>, Trail <dbl>, Triathalon <dbl>

### sw_glance

We can view the model accuracies also by mapping sw_glance within the mutate and map combo.

monthly_qty_by_cat2_fit %>%
mutate(glance = map(fit.ets, sw_glance)) %>%
unnest(glance, .drop = TRUE)
## # A tibble: 9 x 13
##   category.secondary model.desc sigma logLik   AIC   BIC    ME  RMSE   MAE
##   <chr>              <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cross Country Race ETS(M,N,M) 1.07    -464   957   989  33.1   379   275
## 2 Cyclocross         ETS(M,N,M) 1.14    -409   848   879 -79.8   271   189
## 3 Elite Road         ETS(M,N,M) 0.905   -471   972  1004  24.6   447   347
## 4 Endurance Road     ETS(M,N,M) 0.768   -439   909   940 -35.9   294   220
## 5 Fat Bike           ETS(M,N,M) 2.76    -343   715   746 -18.4   276   113
## 6 Over Mountain      ETS(M,N,M) 0.920   -423   877   908 -35.5   209   160
## 7 Sport              ETS(M,N,M) 0.882   -427   884   915 -55.1   256   184
## 8 Trail              ETS(M,A,M) 0.750   -411   855   891 -11.4   203   133
## 9 Triathalon         ETS(M,N,M) 1.54    -410   850   881  17.6   257   148
## # ... with 4 more variables: MPE <dbl>, MAPE <dbl>, MASE <dbl>, ACF1 <dbl>

### sw_augment

The augmented fitted and residual values can be achieved in much the same manner. This returns nine groups data. Note that we pass timetk_idx = TRUE to return the date format times as opposed to the regular (yearmon or numeric) time series.

augment_fit_ets <- monthly_qty_by_cat2_fit %>%
mutate(augment = map(fit.ets, sw_augment, timetk_idx = TRUE, rename_index = "date")) %>%
unnest(augment, .drop = TRUE)

augment_fit_ets
## # A tibble: 538 x 5
##    category.secondary date       .actual .fitted  .resid
##    <chr>              <date>       <dbl>   <dbl>   <dbl>
##  1 Cross Country Race 2011-01-01   122       373 -0.673
##  2 Cross Country Race 2011-02-01   489       201  1.43
##  3 Cross Country Race 2011-03-01   505       465  0.0864
##  4 Cross Country Race 2011-04-01   343       161  1.12
##  5 Cross Country Race 2011-05-01   263       567 -0.537
##  6 Cross Country Race 2011-06-01   735       296  1.48
##  7 Cross Country Race 2011-07-01   183       741 -0.753
##  8 Cross Country Race 2011-08-01    66.0     220 -0.700
##  9 Cross Country Race 2011-09-01    97.0     381 -0.745
## 10 Cross Country Race 2011-10-01   189       123  0.534
## # ... with 528 more rows

We can plot the residuals for the nine categories like so. Unfortunately we do see some very high residuals (especially with “Fat Bike”). This is often the case with realworld data.

augment_fit_ets %>%
ggplot(aes(x = date, y = .resid, group = category.secondary)) +
geom_hline(yintercept = 0, color = "grey40") +
geom_line(color = palette_light()[[2]]) +
geom_smooth(method = "loess") +
labs(title = "Bike Quantity Sold By Secondary Category",
subtitle = "ETS Model Residuals", x = "") +
theme_tq() +
facet_wrap(~ category.secondary, scale = "free_y", ncol = 3) +
scale_x_date(date_labels = "%Y")

### sw_tidy_decomp

We can create decompositions using the same procedure with sw_tidy_decomp() and the mutate and map combo.

monthly_qty_by_cat2_fit %>%
mutate(decomp = map(fit.ets, sw_tidy_decomp, timetk_idx = TRUE, rename_index = "date")) %>%
unnest(decomp)
## # A tibble: 538 x 6
##    category.secondary date       observed level season slope
##    <chr>              <date>        <dbl> <dbl>  <dbl> <dbl>
##  1 Cross Country Race 2011-01-01    122     313  1.16     NA
##  2 Cross Country Race 2011-02-01    489     331  0.643    NA
##  3 Cross Country Race 2011-03-01    505     332  1.41     NA
##  4 Cross Country Race 2011-04-01    343     347  0.487    NA
##  5 Cross Country Race 2011-05-01    263     339  1.64     NA
##  6 Cross Country Race 2011-06-01    735     359  0.873    NA
##  7 Cross Country Race 2011-07-01    183     348  2.06     NA
##  8 Cross Country Race 2011-08-01     66.0   339  0.630    NA
##  9 Cross Country Race 2011-09-01     97.0   329  1.12     NA
## 10 Cross Country Race 2011-10-01    189     336  0.375    NA
## # ... with 528 more rows

## Step 3: Forecasting the model

We can also forecast the multiple models again using a very similar approach with the forecast function. We want a 12 month forecast so we add the argument for the h = 12 (refer to ?forecast for all of the parameters you can add, there’s quite a few).

monthly_qty_by_cat2_fcast <- monthly_qty_by_cat2_fit %>%
mutate(fcast.ets = map(fit.ets, forecast, h = 12))
monthly_qty_by_cat2_fcast
## # A tibble: 9 x 5
##   category.secondary data.tbl          data.ts  fit.ets   fcast.ets
##   <chr>              <list>            <list>   <list>    <list>
## 1 Cross Country Race <tibble [60 x 2]> <S3: ts> <S3: ets> <S3: forecast>
## 2 Cyclocross         <tibble [60 x 2]> <S3: ts> <S3: ets> <S3: forecast>
## 3 Elite Road         <tibble [60 x 2]> <S3: ts> <S3: ets> <S3: forecast>
## 4 Endurance Road     <tibble [60 x 2]> <S3: ts> <S3: ets> <S3: forecast>
## 5 Fat Bike           <tibble [58 x 2]> <S3: ts> <S3: ets> <S3: forecast>
## 6 Over Mountain      <tibble [60 x 2]> <S3: ts> <S3: ets> <S3: forecast>
## 7 Sport              <tibble [60 x 2]> <S3: ts> <S3: ets> <S3: forecast>
## 8 Trail              <tibble [60 x 2]> <S3: ts> <S3: ets> <S3: forecast>
## 9 Triathalon         <tibble [60 x 2]> <S3: ts> <S3: ets> <S3: forecast>

## Step 4: Tidy the forecast

Next, we can apply sw_sweep to get the forecast in a nice “tidy” data frame. We use the argument fitted = FALSE to remove the fitted values from the forecast (leave off if fitted values are desired). We set timetk_idx = TRUE to use dates instead of numeric values for the index. We’ll use unnest() to drop the left over list-columns and return an unnested data frame.

monthly_qty_by_cat2_fcast_tidy <- monthly_qty_by_cat2_fcast %>%
mutate(sweep = map(fcast.ets, sw_sweep, fitted = FALSE, timetk_idx = TRUE)) %>%
unnest(sweep)
monthly_qty_by_cat2_fcast_tidy
## # A tibble: 646 x 8
##    category.secondary index      key    total.qty lo.80 lo.95 hi.80 hi.95
##    <chr>              <date>     <chr>      <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Cross Country Race 2011-01-01 actual     122      NA    NA    NA    NA
##  2 Cross Country Race 2011-02-01 actual     489      NA    NA    NA    NA
##  3 Cross Country Race 2011-03-01 actual     505      NA    NA    NA    NA
##  4 Cross Country Race 2011-04-01 actual     343      NA    NA    NA    NA
##  5 Cross Country Race 2011-05-01 actual     263      NA    NA    NA    NA
##  6 Cross Country Race 2011-06-01 actual     735      NA    NA    NA    NA
##  7 Cross Country Race 2011-07-01 actual     183      NA    NA    NA    NA
##  8 Cross Country Race 2011-08-01 actual      66.0    NA    NA    NA    NA
##  9 Cross Country Race 2011-09-01 actual      97.0    NA    NA    NA    NA
## 10 Cross Country Race 2011-10-01 actual     189      NA    NA    NA    NA
## # ... with 636 more rows

Visualization is just one final step.

monthly_qty_by_cat2_fcast_tidy %>%
ggplot(aes(x = index, y = total.qty, color = key, group = category.secondary)) +
geom_ribbon(aes(ymin = lo.95, ymax = hi.95),
fill = "#D5DBFF", color = NA, size = 0) +
geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key),
fill = "#596DD5", color = NA, size = 0, alpha = 0.8) +
geom_line() +
labs(title = "Bike Quantity Sold By Secondary Category",
subtitle = "ETS Model Forecasts",
x = "", y = "Units") +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
scale_color_tq() +
scale_fill_tq() +
facet_wrap(~ category.secondary, scales = "free_y", ncol = 3) +
theme_tq() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Recap

The sweep package has a several tools to analyze grouped time series. In the next vignette we will review how to apply multiple models to a time series.