Scaling and Modeling with tidyquant

Matt Dancho

2017-08-03

Designed for the data science workflow of the tidyverse

Overview

The greatest benefit to tidyquant is the ability to apply the data science workflow to easily model and scale your financial analysis as described in R for Data Science. Scaling is the process of creating an analysis for one asset and then extending it to multiple groups. This idea of scaling is incredibly useful to financial analysts because typically one wants to compare many assets to make informed decisions. Fortunately, the tidyquant package integrates with the tidyverse making scaling super simple!

All tidyquant functions return data in the tibble (tidy data frame) format, which allows for interaction within the tidyverse. This means we can:

We’ll go through some useful techniques for getting and manipulating groups of data.

Prerequisites

Load the tidyquant package to get started.

# Loads tidyquant, tidyverse, lubridate, xts, quantmod, TTR 
library(tidyquant)  

Scaling the Getting of Financial Data

A very basic example is retrieving the stock prices for multiple stocks. There are three primary ways to do this:

Method 1: Map a character vector with multiple stock symbols

c("AAPL", "GOOG", "FB") %>%
    tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01")
## # A tibble: 756 x 8
##    symbol       date    open    high     low  close   volume  adjusted
##     <chr>     <date>   <dbl>   <dbl>   <dbl>  <dbl>    <dbl>     <dbl>
##  1   AAPL 2016-01-04 105.783 108.628 105.154 105.35 67649400 102.19038
##  2   AAPL 2016-01-05 109.020 109.123 105.576 102.71 55791000  99.62955
##  3   AAPL 2016-01-06 103.669 105.535 102.958 100.70 68457400  97.67985
##  4   AAPL 2016-01-07 101.731 103.226  99.412  96.45 81094400  93.55730
##  5   AAPL 2016-01-08 101.597 102.174  99.752  96.96 70798000  94.05202
##  6   AAPL 2016-01-11 102.030 102.123 100.350  98.53 49739400  95.57493
##  7   AAPL 2016-01-12 103.659 103.803 101.896  99.96 49154200  96.96204
##  8   AAPL 2016-01-13 103.422 104.319 100.308  97.39 62439600  94.46912
##  9   AAPL 2016-01-14 100.989 103.587  98.700  99.52 63170100  96.53522
## 10   AAPL 2016-01-15  99.174 100.731  98.308  97.13 79010000  94.21691
## # ... with 746 more rows

The output is a single level tibble with all or the stock prices in one tibble. The auto-generated column name is “symbol”, which can be pre-emptively renamed by giving the vector a name (e.g. stocks <- c("AAPL", "GOOG", "FB")) and then piping to tq_get.

Method 2: Map a tibble with stocks in first column

First, get a stock list in data frame format either by making the tibble or retrieving from tq_index / tq_exchange. The stock symbols must be in the first column.

Method 2A: Make a tibble

stock_list <- tibble(stocks = c("AAPL", "JPM", "CVX"),
                     industry = c("Technology", "Financial", "Energy"))
stock_list
## # A tibble: 3 x 2
##   stocks   industry
##    <chr>      <chr>
## 1   AAPL Technology
## 2    JPM  Financial
## 3    CVX     Energy

Second, send the stock list to tq_get. Notice how the symbol and industry columns are automatically expanded the length of the stock prices.

stock_list %>%
    tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01")
## # A tibble: 756 x 9
##    stocks   industry       date    open    high     low  close   volume
##     <chr>      <chr>     <date>   <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
##  1   AAPL Technology 2016-01-04 105.783 108.628 105.154 105.35 67649400
##  2   AAPL Technology 2016-01-05 109.020 109.123 105.576 102.71 55791000
##  3   AAPL Technology 2016-01-06 103.669 105.535 102.958 100.70 68457400
##  4   AAPL Technology 2016-01-07 101.731 103.226  99.412  96.45 81094400
##  5   AAPL Technology 2016-01-08 101.597 102.174  99.752  96.96 70798000
##  6   AAPL Technology 2016-01-11 102.030 102.123 100.350  98.53 49739400
##  7   AAPL Technology 2016-01-12 103.659 103.803 101.896  99.96 49154200
##  8   AAPL Technology 2016-01-13 103.422 104.319 100.308  97.39 62439600
##  9   AAPL Technology 2016-01-14 100.989 103.587  98.700  99.52 63170100
## 10   AAPL Technology 2016-01-15  99.174 100.731  98.308  97.13 79010000
## # ... with 746 more rows, and 1 more variables: adjusted <dbl>

Method 2B: Use index or exchange

Get an index…

tq_index("DOW")
## # A tibble: 30 x 5
##    symbol                                     company     weight
##     <chr>                                       <chr>      <dbl>
##  1     BA                              Boeing Company 0.07401611
##  2     GS                    Goldman Sachs Group Inc. 0.07038297
##  3    MMM                                  3M Company 0.06389433
##  4    UNH             UnitedHealth Group Incorporated 0.06003411
##  5   AAPL                                  Apple Inc. 0.04887957
##  6    MCD                      McDonald's Corporation 0.04870850
##  7     HD                             Home Depot Inc. 0.04681416
##  8    IBM International Business Machines Corporation 0.04493225
##  9    JNJ                           Johnson & Johnson 0.04110935
## 10    TRV                    Travelers Companies Inc. 0.04004242
## # ... with 20 more rows, and 2 more variables: sector <chr>,
## #   shares_held <dbl>

…or, get an exchange.

tq_exchange("NYSE")
## # A tibble: 3,143 x 7
##    symbol                company last.sale.price market.cap ipo.year
##     <chr>                  <chr>           <dbl>      <chr>    <dbl>
##  1    DDD 3D Systems Corporation           17.01     $1.94B       NA
##  2    MMM             3M Company          205.41   $122.58B       NA
##  3   WBAI        500.com Limited           11.12   $453.34M     2013
##  4   WUBA            58.com Inc.           50.40     $7.32B     2013
##  5    AHC  A.H. Belo Corporation            5.20      $113M       NA
##  6   ATEN     A10 Networks, Inc.            7.06   $489.52M     2014
##  7    AAC     AAC Holdings, Inc.            6.06   $145.88M     2014
##  8    AIR              AAR Corp.           36.85     $1.27B       NA
##  9    AAN     Aaron&#39;s,  Inc.           45.93     $3.33B       NA
## 10    ABB                ABB Ltd           23.61    $50.38B       NA
## # ... with 3,133 more rows, and 2 more variables: sector <chr>,
## #   industry <chr>

Send the index or exchange to tq_get. Important Note: This can take several minutes depending on the size of the index or exchange, which is why only the first three stocks are evaluated in the vignette.

tq_index("DOW") %>%
    slice(1:3) %>%
    tq_get(get = "stock.prices")
## # A tibble: 7,995 x 12
##    symbol        company     weight      sector shares_held       date
##     <chr>          <chr>      <dbl>       <chr>       <dbl>     <date>
##  1     BA Boeing Company 0.07401611 Industrials     5244157 2007-01-03
##  2     BA Boeing Company 0.07401611 Industrials     5244157 2007-01-04
##  3     BA Boeing Company 0.07401611 Industrials     5244157 2007-01-05
##  4     BA Boeing Company 0.07401611 Industrials     5244157 2007-01-08
##  5     BA Boeing Company 0.07401611 Industrials     5244157 2007-01-09
##  6     BA Boeing Company 0.07401611 Industrials     5244157 2007-01-10
##  7     BA Boeing Company 0.07401611 Industrials     5244157 2007-01-11
##  8     BA Boeing Company 0.07401611 Industrials     5244157 2007-01-12
##  9     BA Boeing Company 0.07401611 Industrials     5244157 2007-01-16
## 10     BA Boeing Company 0.07401611 Industrials     5244157 2007-01-17
## # ... with 7,985 more rows, and 6 more variables: open <dbl>, high <dbl>,
## #   low <dbl>, close <dbl>, volume <dbl>, adjusted <dbl>

You can use any applicable “getter” to get data for every stock in an index or an exchange! This includes: “stock.prices”, “key.ratios”, “key.stats”, “financials”, and more.

Method 3: Use purrr to map a function

We can pipe a tibble of stock symbols to a mutation that maps the tq_get(get = "stock.prices") function. The result is all of the stock prices in nested format.

tibble(symbol = c("AAPL", "GOOG", "AMZN", "FB")) %>%
    mutate(stock.prices = map(.x = symbol, ~ tq_get(.x, get = "stock.prices")))
## # A tibble: 4 x 2
##   symbol         stock.prices
##    <chr>               <list>
## 1   AAPL <tibble [2,665 x 7]>
## 2   GOOG <tibble [2,665 x 7]>
## 3   AMZN <tibble [2,665 x 7]>
## 4     FB <tibble [1,310 x 7]>

Method 4: Compound Getters

In financial analysis, it’s very common to need data from various sources to combine in an analysis. For this reason multiple get options (“compound getters”) can be used to return a “compound get”. A quick example:

c("AAPL", "GOOG") %>%
    tq_get(get = c("stock.prices", "financials"))
## # A tibble: 2 x 3
##   symbol         stock.prices       financials
##    <chr>               <list>           <list>
## 1   AAPL <tibble [2,665 x 7]> <tibble [3 x 3]>
## 2   GOOG <tibble [2,665 x 7]> <tibble [3 x 3]>

This returns the stock prices and financials for each stock as one nested data frame! Any of the get options that accept stock symbols can be used in this manner: "stock.prices", "financials", "key.ratios", "key.stats", "dividends", and "splits".

Scaling the Mutation of Financial Data

Once you get the data, you typically want to do something with it. You can easily do this at scale. Let’s get the yearly returns for multiple stocks using tq_transmute. First, get the prices. We’ll use the FANG data set, but you typically will use tq_get to retrieve data in “tibble” format.

data("FANG")
FANG
## # A tibble: 4,032 x 8
##    symbol       date  open  high   low close    volume adjusted
##     <chr>     <date> <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
##  1     FB 2013-01-02 27.44 28.18 27.42 28.00  69846400    28.00
##  2     FB 2013-01-03 27.88 28.47 27.59 27.77  63140600    27.77
##  3     FB 2013-01-04 28.01 28.93 27.83 28.76  72715400    28.76
##  4     FB 2013-01-07 28.69 29.79 28.65 29.42  83781800    29.42
##  5     FB 2013-01-08 29.51 29.60 28.86 29.06  45871300    29.06
##  6     FB 2013-01-09 29.67 30.60 29.49 30.59 104787700    30.59
##  7     FB 2013-01-10 30.60 31.45 30.28 31.30  95316400    31.30
##  8     FB 2013-01-11 31.28 31.96 31.10 31.72  89598000    31.72
##  9     FB 2013-01-14 32.08 32.21 30.62 30.95  98892800    30.95
## 10     FB 2013-01-15 30.64 31.71 29.88 30.10 173242600    30.10
## # ... with 4,022 more rows

Second, use group_by to group by stock symbol. Third, apply the mutation. We can do this in one easy workflow. The periodReturns function is applied to each group of stock prices, and a new data frame was returned with the annual returns in the correct periodicity.

FANG_returns_yearly <- FANG %>%
    group_by(symbol) %>%
    tq_transmute(select     = adjusted, 
                 mutate_fun = periodReturn, 
                 period     = "yearly", 
                 col_rename = "yearly.returns") 

Last, we can visualize the returns.

FANG_returns_yearly %>%
    ggplot(aes(x = year(date), y = yearly.returns, fill = symbol)) +
    geom_bar(position = "dodge", stat = "identity") +
    labs(title = "FANG: Annual Returns", 
         subtitle = "Mutating at scale is quick and easy!",
         y = "Returns", x = "", color = "") +
    scale_y_continuous(labels = scales::percent) +
    theme_tq() +
    scale_fill_tq()

Modeling Financial Data using purrr

Eventually you will want to begin modeling (or more generally applying functions) at scale! One of the best features of the tidyverse is the ability to map functions to nested tibbles using purrr. From the Many Models chapter of “R for Data Science”, we can apply the same modeling workflow to financial analysis. Using a two step workflow:

  1. Model a single stock
  2. Scale to many stocks

Let’s go through an example to illustrate.

Example: Applying a Regression Model to Detect a Positive Trend

In this example, we’ll use a simple linear model to identify the trend in annual returns to determine if the stock returns are decreasing or increasing over time.

Analyze a Single Stock

First, let’s come up with a function to help us collect annual log returns. The function below performs three operations internally. It first gets the stock prices using tq_get(). Then, it mutates the stock prices to period returns using tq_transmute(). We add the type = "log" and period = "monthly" arguments to ensure we retrieve a tibble of monthly log returns. Last, we take the mean of the monthly returns to get MMLR.

get_annual_returns <- function(stock.symbol) {
    stock.symbol %>%
        tq_get(get  = "stock.prices",
               from = "2007-01-01",
               to   = "2016-12-31") %>%
        tq_transmute(select     = adjusted, 
                     mutate_fun = periodReturn, 
                     type       = "log", 
                     period     = "yearly")
}

Let’s test get_annual_returns out. We now have the annual log returns over the past ten years.

AAPL_annual_log_returns <- get_annual_returns("AAPL")
AAPL_annual_log_returns
## # A tibble: 10 x 2
##          date yearly.returns
##        <date>          <dbl>
##  1 2007-12-31     0.86023791
##  2 2008-12-31    -0.84191019
##  3 2009-12-31     0.90381706
##  4 2010-12-31     0.42571145
##  5 2011-12-30     0.22759782
##  6 2012-12-31     0.28191703
##  7 2013-12-31     0.07760429
##  8 2014-12-31     0.34090871
##  9 2015-12-31    -0.03060048
## 10 2016-12-30     0.11760900

Let’s visualize to identify trends. We can see from the linear trend line that AAPL’s stock returns are declining.

AAPL_annual_log_returns %>%
    ggplot(aes(x = year(date), y = yearly.returns)) + 
    geom_hline(yintercept = 0, color = palette_light()[[1]]) +
    geom_point(size = 2, color = palette_light()[[3]]) +
    geom_line(size = 1, color = palette_light()[[3]]) + 
    geom_smooth(method = "lm", se = FALSE) +
    labs(title = "AAPL: Visualizing Trends in Annual Returns",
         x = "", y = "Annual Returns", color = "") +
    theme_tq()

Now, we can get the linear model using the lm() function. However, there is one problem: the output is not “tidy”.

mod <- lm(yearly.returns ~ year(date), data = AAPL_annual_log_returns)
mod
## 
## Call:
## lm(formula = yearly.returns ~ year(date), data = AAPL_annual_log_returns)
## 
## Coefficients:
## (Intercept)   year(date)  
##    58.86281     -0.02915

We can utilize the broom package to get “tidy” data from the model. There’s three primary functions:

  1. augment: adds columns to the original data such as predictions, residuals and cluster assignments
  2. glance: provides a one-row summary of model-level statistics
  3. tidy: summarizes a model’s statistical findings such as coefficients of a regression

We’ll use tidy to retrieve the model coefficients.

library(broom)
tidy(mod)
##          term    estimate    std.error  statistic   p.value
## 1 (Intercept) 58.86281346 113.10732941  0.5204156 0.6168657
## 2  year(date) -0.02914567   0.05623028 -0.5183270 0.6182571

Adding to our workflow, we have the following:

get_model <- function(stock.symbol) {
    annual_returns <- get_annual_returns(stock.symbol)
    mod <- lm(yearly.returns ~ year(date), data = annual_returns)
    tidy(mod)
}

Testing it out on a single stock. We can see that the “term” that contains the direction of the trend (the slope) is “year(date)”. The interpetation is that as year increases one unit, the annual returns decrease by 3%.

get_model("AAPL")
##          term    estimate    std.error  statistic   p.value
## 1 (Intercept) 58.86281346 113.10732941  0.5204156 0.6168657
## 2  year(date) -0.02914567   0.05623028 -0.5183270 0.6182571

Now that we have identified the trend direction, it looks like we are ready to scale.

Scale to Many Stocks

Once the analysis for one stock is done scale to many stocks is simple. For brevity, we’ll randomly sample ten stocks from the S&P500 with a call to dplyr::sample_n().

set.seed(10)
stocks <- tq_index("SP500") %>%
    sample_n(5)
stocks
## # A tibble: 5 x 5
##   symbol                           company       weight
##    <chr>                             <chr>        <dbl>
## 1     DG        Dollar General Corporation 0.0008937877
## 2    STZ Constellation Brands Inc. Class A 0.0015736162
## 3    ZBH       Zimmer Biomet Holdings Inc. 0.0011148646
## 4    HLT     Hilton Worldwide Holdings Inc 0.0005953980
## 5   CELG               Celgene Corporation 0.0049637888
## # ... with 2 more variables: sector <chr>, shares_held <dbl>

We can now apply our analysis function to the stocks using dplyr::mutate and purrr::map. The mutate() function adds a column to our tibble, and the map() function maps our custom get_model function to our tibble of stocks using the symbol column. The tidyr::unnest function unrolls the nested data frame so all of the model statistics are accessable in the top data frame level. The filter, arrange and select steps just manipulate the data frame to isolate and arrange the data for our viewing.

stocks_model_stats <- stocks %>%
    mutate(model = map(symbol, get_model)) %>%
    unnest() %>%
    filter(term == "year(date)") %>%
    arrange(desc(estimate)) %>%
    select(-term)
stocks_model_stats
## # A tibble: 5 x 9
##   symbol                           company       weight
##    <chr>                             <chr>        <dbl>
## 1    STZ Constellation Brands Inc. Class A 0.0015736162
## 2    ZBH       Zimmer Biomet Holdings Inc. 0.0011148646
## 3    HLT     Hilton Worldwide Holdings Inc 0.0005953980
## 4   CELG               Celgene Corporation 0.0049637888
## 5     DG        Dollar General Corporation 0.0008937877
## # ... with 6 more variables: sector <chr>, shares_held <dbl>,
## #   estimate <dbl>, std.error <dbl>, statistic <dbl>, p.value <dbl>

We’re done! We now have the coefficient of the linear regression that tracks the direction of the trend line. We can easily extend this type of analysis to larger lists or stock indexes. For example, the entire S&P500 could be analyzed removing the sample_n() following the call to tq_index("SP500").

Error Handling when Scaling

Eventually you will run into a stock index, stock symbol, FRED data code, etc that cannot be retrieved. Possible reasons are:

This becomes painful when scaling if the functions return errors. So, the tq_get() function is designed to handle errors gracefully. What this means is an NA value is returned when an error is generated along with a gentle error warning.

tq_get("XYZ", "stock.prices")
## [1] NA

Pros and Cons to Built-In Error-Handling

There are pros and cons to this approach that you may not agree with, but I believe helps in the long run. Just be aware of what happens:

Bad Apples Fail Gracefully, tq_get

Let’s see an example when using tq_get() to get the stock prices for a long list of stocks with one BAD APPLE. The argument complete_cases comes in handy. The default is TRUE, which removes “bad apples” so future analysis have complete cases to compute on. Note that a gentle warning stating that an error occurred and was dealt with by removing the rows from the results.

c("AAPL", "GOOG", "BAD APPLE") %>%
    tq_get(get = "stock.prices", complete_cases = TRUE)
## Warning: x = 'BAD APPLE', get = 'stock.prices': Error: BAD APPLE download failed after two attempts. Error message:
## HTTP error 400.
##  Removing BAD APPLE.
## # A tibble: 5,330 x 8
##    symbol       date   open   high    low    close    volume adjusted
##     <chr>     <date>  <dbl>  <dbl>  <dbl>    <dbl>     <dbl>    <dbl>
##  1   AAPL 2007-01-03 13.648 13.694 12.954 11.97143 309579900 10.81246
##  2   AAPL 2007-01-04 13.294 13.595 13.258 12.23714 211815100 11.05245
##  3   AAPL 2007-01-05 13.566 13.634 13.350 12.15000 208685400 10.97374
##  4   AAPL 2007-01-08 13.596 13.686 13.489 12.21000 199276700 11.02793
##  5   AAPL 2007-01-09 13.674 14.707 13.468 13.22429 837324600 11.94403
##  6   AAPL 2007-01-10 14.987 15.469 14.781 13.85714 738220000 12.51562
##  7   AAPL 2007-01-11 15.175 15.308 15.042 13.68572 360063200 12.36079
##  8   AAPL 2007-01-12 14.961 15.036 14.746 13.51714 328172600 12.20853
##  9   AAPL 2007-01-16 15.134 15.382 15.097 13.87143 311019100 12.52852
## 10   AAPL 2007-01-17 15.431 15.437 14.998 13.56428 411565000 12.25111
## # ... with 5,320 more rows

Now switching complete_cases = FALSE will retain any errors as NA values in a nested data frame. Notice that the error message and output change. The error message now states that the NA values exist in the output and the return is a “nested” data structure.

c("AAPL", "GOOG", "BAD APPLE") %>%
    tq_get(get = "stock.prices", complete_cases = FALSE)
## Warning: x = 'BAD APPLE', get = 'stock.prices': Error: BAD APPLE download failed after two attempts. Error message:
## HTTP error 400.
## Warning in value[[3L]](cond): Returning as nested data frame.
## # A tibble: 3 x 2
##      symbol         stock.prices
##       <chr>               <list>
## 1      AAPL <tibble [2,665 x 7]>
## 2      GOOG <tibble [2,665 x 7]>
## 3 BAD APPLE            <lgl [1]>

In both cases, the prudent user will review the warnings to determine what happened and whether or not this is acceptable. In the complete_cases = FALSE example, if the user attempts to perform downstream computations at scale, the computations will likely fail grinding the analysis to a hault. But, the advantage is that the user will more easily be able to filter to the problem childs to determine what happened and decide whether this is acceptable or not.