Designed for the data science workflow of the
tidyverse
The greatest benefit to tidyquant
is the ability to apply the data science workflow to easily model and scale your financial analysis as described in R for Data Science. Scaling is the process of creating an analysis for one asset and then extending it to multiple groups. This idea of scaling is incredibly useful to financial analysts because typically one wants to compare many assets to make informed decisions. Fortunately, the tidyquant
package integrates with the tidyverse
making scaling super simple!
All tidyquant
functions return data in the tibble
(tidy data frame) format, which allows for interaction within the tidyverse
. This means we can:
%>%
) for chaining operationsdplyr
and tidyr
: select
, filter
, group_by
, nest
/unnest
, spread
/gather
, etcpurrr
: mapping functions with map
We’ll go through some useful techniques for getting and manipulating groups of data.
Load the tidyquant
package to get started.
# Loads tidyquant, tidyverse, lubridate, xts, quantmod, TTR
library(tidyquant)
A very basic example is retrieving the stock prices for multiple stocks. There are three primary ways to do this:
c("AAPL", "GOOG", "FB") %>%
tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01")
## # A tibble: 756 x 8
## symbol date open high low close volume adjusted
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 2016-01-04 103 105 102 105 67649400 101
## 2 AAPL 2016-01-05 106 106 102 103 55791000 98.5
## 3 AAPL 2016-01-06 101 102 99.9 101 68457400 96.6
## 4 AAPL 2016-01-07 98.7 100 96.4 96.4 81094400 92.5
## 5 AAPL 2016-01-08 98.6 99.1 96.8 97.0 70798000 93.0
## 6 AAPL 2016-01-11 99.0 99.1 97.3 98.5 49739400 94.5
## 7 AAPL 2016-01-12 101 101 98.8 100.0 49154200 95.8
## 8 AAPL 2016-01-13 100 101 97.3 97.4 62439600 93.4
## 9 AAPL 2016-01-14 98.0 100 95.7 99.5 63170100 95.4
## 10 AAPL 2016-01-15 96.2 97.7 95.4 97.1 79010000 93.1
## # ... with 746 more rows
The output is a single level tibble with all or the stock prices in one tibble. The auto-generated column name is “symbol”, which can be pre-emptively renamed by giving the vector a name (e.g. stocks <- c("AAPL", "GOOG", "FB")
) and then piping to tq_get
.
First, get a stock list in data frame format either by making the tibble or retrieving from tq_index
/ tq_exchange
. The stock symbols must be in the first column.
stock_list <- tibble(stocks = c("AAPL", "JPM", "CVX"),
industry = c("Technology", "Financial", "Energy"))
stock_list
## # A tibble: 3 x 2
## stocks industry
## <chr> <chr>
## 1 AAPL Technology
## 2 JPM Financial
## 3 CVX Energy
Second, send the stock list to tq_get
. Notice how the symbol and industry columns are automatically expanded the length of the stock prices.
stock_list %>%
tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01")
## # A tibble: 756 x 9
## stocks industry date open high low close volume adjusted
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL Technology 2016-01-04 103 105 102 105 67649400 101
## 2 AAPL Technology 2016-01-05 106 106 102 103 55791000 98.5
## 3 AAPL Technology 2016-01-06 101 102 99.9 101 68457400 96.6
## 4 AAPL Technology 2016-01-07 98.7 100 96.4 96.4 81094400 92.5
## 5 AAPL Technology 2016-01-08 98.6 99.1 96.8 97.0 70798000 93.0
## 6 AAPL Technology 2016-01-11 99.0 99.1 97.3 98.5 49739400 94.5
## 7 AAPL Technology 2016-01-12 101 101 98.8 100.0 49154200 95.8
## 8 AAPL Technology 2016-01-13 100 101 97.3 97.4 62439600 93.4
## 9 AAPL Technology 2016-01-14 98.0 100 95.7 99.5 63170100 95.4
## 10 AAPL Technology 2016-01-15 96.2 97.7 95.4 97.1 79010000 93.1
## # ... with 746 more rows
Get an index…
tq_index("DOW", use_fallback = TRUE)
## # A tibble: 30 x 5
## symbol company weight sector shares_held
## <chr> <chr> <dbl> <chr> <dbl>
## 1 GS Goldman Sachs Group Inc. 0.0695 Financials 5288604
## 2 MMM 3M Company 0.0681 Industrials 5288606
## 3 BA Boeing Company 0.0647 Industrials 5288604
## 4 UNH UnitedHealth Group Incorpora… 0.0593 Health Care 5288606
## 5 MCD McDonald's Corporation 0.0495 Consumer Discr… 5288606
## 6 IBM International Business Machi… 0.0493 Information Te… 5288604
## 7 HD Home Depot Inc. 0.0484 Consumer Discr… 5288606
## 8 AAPL Apple Inc. 0.0468 Information Te… 5288604
## 9 JNJ Johnson & Johnson 0.0437 Health Care 5288604
## 10 TRV Travelers Companies Inc. 0.0409 Financials 5288604
## # ... with 20 more rows
…or, get an exchange.
tq_exchange("NYSE")
## # A tibble: 3,144 x 7
## symbol company last.sale.price market.cap ipo.year sector industry
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 DDD 3D Syste… 10.2 $1.16B NA Techn… Computer S…
## 2 MMM 3M Compa… 237 $140.94B NA Healt… Medical/De…
## 3 WBAI 500.com … 15.6 $638.19M 2013 Consu… Services-M…
## 4 WUBA 58.com I… 78.3 $11.47B 2013 Techn… Computer S…
## 5 EGHT 8x8 Inc 18.5 $1.71B NA Publi… Telecommun…
## 6 AHC A.H. Bel… 5.25 $114.2M NA Consu… Newspapers…
## 7 ATEN A10 Netw… 6.35 $449.27M 2014 Techn… Computer C…
## 8 AAC AAC Hold… 8.32 $199.67M 2014 Healt… Medical Sp…
## 9 AIR AAR Corp. 42.5 $1.48B NA Capit… Aerospace
## 10 AAN Aaron… 43.4 $3.15B NA Techn… Diversifie…
## # ... with 3,134 more rows
Send the index or exchange to tq_get
. Important Note: This can take several minutes depending on the size of the index or exchange, which is why only the first three stocks are evaluated in the vignette.
tq_index("DOW", use_fallback = TRUE) %>%
slice(1:3) %>%
tq_get(get = "stock.prices")
## # A tibble: 7,653 x 12
## symbol company weight sector shares_held date open high low
## <chr> <chr> <dbl> <chr> <dbl> <date> <dbl> <dbl> <dbl>
## 1 GS Goldman … 0.0695 Finan… 5288604 2008-01-02 215 215 207
## 2 GS Goldman … 0.0695 Finan… 5288604 2008-01-03 209 209 204
## 3 GS Goldman … 0.0695 Finan… 5288604 2008-01-04 201 204 198
## 4 GS Goldman … 0.0695 Finan… 5288604 2008-01-07 200 200 190
## 5 GS Goldman … 0.0695 Finan… 5288604 2008-01-08 196 200 189
## 6 GS Goldman … 0.0695 Finan… 5288604 2008-01-09 189 192 184
## 7 GS Goldman … 0.0695 Finan… 5288604 2008-01-10 189 199 187
## 8 GS Goldman … 0.0695 Finan… 5288604 2008-01-11 197 201 195
## 9 GS Goldman … 0.0695 Finan… 5288604 2008-01-14 200 202 199
## 10 GS Goldman … 0.0695 Finan… 5288604 2008-01-15 200 200 192
## # ... with 7,643 more rows, and 3 more variables: close <dbl>,
## # volume <dbl>, adjusted <dbl>
You can use any applicable “getter” to get data for every stock in an index or an exchange! This includes: “stock.prices”, “key.ratios”, “key.stats”, “financials”, and more.
We can pipe a tibble of stock symbols to a mutation that maps the tq_get(get = "stock.prices")
function. The result is all of the stock prices in nested format.
tibble(symbol = c("AAPL", "GOOG", "AMZN", "FB")) %>%
mutate(stock.prices = map(.x = symbol, ~ tq_get(.x, get = "stock.prices")))
## # A tibble: 4 x 2
## symbol stock.prices
## <chr> <list>
## 1 AAPL <tibble [2,551 × 7]>
## 2 GOOG <tibble [2,551 × 7]>
## 3 AMZN <tibble [2,551 × 7]>
## 4 FB <tibble [1,447 × 7]>
In financial analysis, it’s very common to need data from various sources to combine in an analysis. For this reason multiple get
options (“compound getters”) can be used to return a “compound get”. A quick example:
c("AAPL", "GOOG") %>%
tq_get(get = c("stock.prices", "financials"))
## # A tibble: 2 x 3
## symbol stock.prices financials
## <chr> <list> <list>
## 1 AAPL <tibble [2,551 × 7]> <tibble [3 × 3]>
## 2 GOOG <tibble [2,551 × 7]> <tibble [3 × 3]>
This returns the stock prices and financials for each stock as one nested data frame! Any of the get
options that accept stock symbols can be used in this manner: "stock.prices"
, "financials"
, "key.ratios"
, "key.stats"
, "dividends"
, and "splits"
.
Once you get the data, you typically want to do something with it. You can easily do this at scale. Let’s get the yearly returns for multiple stocks using tq_transmute
. First, get the prices. We’ll use the FANG
data set, but you typically will use tq_get
to retrieve data in “tibble” format.
data("FANG")
FANG
## # A tibble: 4,032 x 8
## symbol date open high low close volume adjusted
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FB 2013-01-02 27.4 28.2 27.4 28.0 69846400 28.0
## 2 FB 2013-01-03 27.9 28.5 27.6 27.8 63140600 27.8
## 3 FB 2013-01-04 28.0 28.9 27.8 28.8 72715400 28.8
## 4 FB 2013-01-07 28.7 29.8 28.6 29.4 83781800 29.4
## 5 FB 2013-01-08 29.5 29.6 28.9 29.1 45871300 29.1
## 6 FB 2013-01-09 29.7 30.6 29.5 30.6 104787700 30.6
## 7 FB 2013-01-10 30.6 31.5 30.3 31.3 95316400 31.3
## 8 FB 2013-01-11 31.3 32.0 31.1 31.7 89598000 31.7
## 9 FB 2013-01-14 32.1 32.2 30.6 31.0 98892800 31.0
## 10 FB 2013-01-15 30.6 31.7 29.9 30.1 173242600 30.1
## # ... with 4,022 more rows
Second, use group_by
to group by stock symbol. Third, apply the mutation. We can do this in one easy workflow. The periodReturns
function is applied to each group of stock prices, and a new data frame was returned with the annual returns in the correct periodicity.
FANG_returns_yearly <- FANG %>%
group_by(symbol) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
col_rename = "yearly.returns")
Last, we can visualize the returns.
FANG_returns_yearly %>%
ggplot(aes(x = year(date), y = yearly.returns, fill = symbol)) +
geom_bar(position = "dodge", stat = "identity") +
labs(title = "FANG: Annual Returns",
subtitle = "Mutating at scale is quick and easy!",
y = "Returns", x = "", color = "") +
scale_y_continuous(labels = scales::percent) +
theme_tq() +
scale_fill_tq()
Eventually you will want to begin modeling (or more generally applying functions) at scale! One of the best features of the tidyverse
is the ability to map functions to nested tibbles using purrr
. From the Many Models chapter of “R for Data Science”, we can apply the same modeling workflow to financial analysis. Using a two step workflow:
Let’s go through an example to illustrate.
In this example, we’ll use a simple linear model to identify the trend in annual returns to determine if the stock returns are decreasing or increasing over time.
First, let’s come up with a function to help us collect annual log returns. The function below performs three operations internally. It first gets the stock prices using tq_get()
. Then, it mutates the stock prices to period returns using tq_transmute()
. We add the type = "log"
and period = "monthly"
arguments to ensure we retrieve a tibble of monthly log returns. Last, we take the mean of the monthly returns to get MMLR.
get_annual_returns <- function(stock.symbol) {
stock.symbol %>%
tq_get(get = "stock.prices",
from = "2007-01-01",
to = "2016-12-31") %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
type = "log",
period = "yearly")
}
Let’s test get_annual_returns
out. We now have the annual log returns over the past ten years.
AAPL_annual_log_returns <- get_annual_returns("AAPL")
AAPL_annual_log_returns
## # A tibble: 10 x 2
## date yearly.returns
## <date> <dbl>
## 1 2007-12-31 0.860
## 2 2008-12-31 -0.842
## 3 2009-12-31 0.904
## 4 2010-12-31 0.426
## 5 2011-12-30 0.228
## 6 2012-12-31 0.336
## 7 2013-12-31 0.230
## 8 2014-12-31 0.412
## 9 2015-12-31 -0.0306
## 10 2016-12-30 0.118
Let’s visualize to identify trends. We can see from the linear trend line that AAPL’s stock returns are declining.
AAPL_annual_log_returns %>%
ggplot(aes(x = year(date), y = yearly.returns)) +
geom_hline(yintercept = 0, color = palette_light()[[1]]) +
geom_point(size = 2, color = palette_light()[[3]]) +
geom_line(size = 1, color = palette_light()[[3]]) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "AAPL: Visualizing Trends in Annual Returns",
x = "", y = "Annual Returns", color = "") +
theme_tq()
Now, we can get the linear model using the lm()
function. However, there is one problem: the output is not “tidy”.
mod <- lm(yearly.returns ~ year(date), data = AAPL_annual_log_returns)
mod
##
## Call:
## lm(formula = yearly.returns ~ year(date), data = AAPL_annual_log_returns)
##
## Coefficients:
## (Intercept) year(date)
## 48.36001 -0.02391
We can utilize the broom
package to get “tidy” data from the model. There’s three primary functions:
augment
: adds columns to the original data such as predictions, residuals and cluster assignmentsglance
: provides a one-row summary of model-level statisticstidy
: summarizes a model’s statistical findings such as coefficients of a regressionWe’ll use tidy
to retrieve the model coefficients.
library(broom)
tidy(mod)
## term estimate std.error statistic p.value
## 1 (Intercept) 48.36001405 113.58647880 0.4257550 0.6815091
## 2 year(date) -0.02391054 0.05646849 -0.4234316 0.6831346
Adding to our workflow, we have the following:
get_model <- function(stock.symbol) {
annual_returns <- get_annual_returns(stock.symbol)
mod <- lm(yearly.returns ~ year(date), data = annual_returns)
tidy(mod)
}
Testing it out on a single stock. We can see that the “term” that contains the direction of the trend (the slope) is “year(date)”. The interpetation is that as year increases one unit, the annual returns decrease by 3%.
get_model("AAPL")
## term estimate std.error statistic p.value
## 1 (Intercept) 48.36001405 113.58647880 0.4257550 0.6815091
## 2 year(date) -0.02391054 0.05646849 -0.4234316 0.6831346
Now that we have identified the trend direction, it looks like we are ready to scale.
Once the analysis for one stock is done scale to many stocks is simple. For brevity, we’ll randomly sample ten stocks from the S&P500 with a call to dplyr::sample_n()
.
set.seed(10)
stocks <- tq_index("SP500", use_fallback = TRUE) %>%
sample_n(5)
stocks
## # A tibble: 5 x 5
## symbol company weight sector shares_held
## <chr> <chr> <dbl> <chr> <dbl>
## 1 WLTW Willis Towers Watson Public … 8.95e⁻⁴ Financials 1476521
## 2 EXC Exelon Corporation 1.62e⁻³ Utilities 10625883
## 3 CBS CBS Corporation Class B 1.13e⁻³ Consumer Discr… 4263471
## 4 FTI TechnipFMC Plc 5.97e⁻⁴ Energy 5362002
## 5 HON Honeywell International Inc. 4.89e⁻³ Industrials 8746874
We can now apply our analysis function to the stocks using dplyr::mutate
and purrr::map
. The mutate()
function adds a column to our tibble, and the map()
function maps our custom get_model
function to our tibble of stocks using the symbol
column. The tidyr::unnest
function unrolls the nested data frame so all of the model statistics are accessable in the top data frame level. The filter
, arrange
and select
steps just manipulate the data frame to isolate and arrange the data for our viewing.
stocks_model_stats <- stocks %>%
mutate(model = map(symbol, get_model)) %>%
unnest() %>%
filter(term == "year(date)") %>%
arrange(desc(estimate)) %>%
select(-term)
stocks_model_stats
## # A tibble: 5 x 9
## symbol company weight sector shares_held estimate std.error statistic
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CBS CBS Corp… 1.13e⁻³ Consu… 4263471 0.0468 0.0562 0.833
## 2 WLTW Willis T… 8.95e⁻⁴ Finan… 1476521 0.0189 0.0232 0.815
## 3 HON Honeywel… 4.89e⁻³ Indus… 8746874 0.0162 0.0322 0.504
## 4 EXC Exelon C… 1.62e⁻³ Utili… 10625883 0.0158 0.0296 0.532
## 5 FTI TechnipF… 5.97e⁻⁴ Energy 5362002 -0.0393 0.0574 -0.685
## # ... with 1 more variable: p.value <dbl>
We’re done! We now have the coefficient of the linear regression that tracks the direction of the trend line. We can easily extend this type of analysis to larger lists or stock indexes. For example, the entire S&P500 could be analyzed removing the sample_n()
following the call to tq_index("SP500")
.
Eventually you will run into a stock index, stock symbol, FRED data code, etc that cannot be retrieved. Possible reasons are:
This becomes painful when scaling if the functions return errors. So, the tq_get()
function is designed to handle errors gracefully. What this means is an NA
value is returned when an error is generated along with a gentle error warning.
tq_get("XYZ", "stock.prices")
## [1] NA
There are pros and cons to this approach that you may not agree with, but I believe helps in the long run. Just be aware of what happens:
Pros: Long running scripts are not interrupted because of one error
Cons: Errors can be inadvertently handled or flow downstream if the users does not read the warnings
Let’s see an example when using tq_get()
to get the stock prices for a long list of stocks with one BAD APPLE
. The argument complete_cases
comes in handy. The default is TRUE
, which removes “bad apples” so future analysis have complete cases to compute on. Note that a gentle warning stating that an error occurred and was dealt with by removing the rows from the results.
c("AAPL", "GOOG", "BAD APPLE") %>%
tq_get(get = "stock.prices", complete_cases = TRUE)
## Warning: x = 'BAD APPLE', get = 'stock.prices': Error: BAD APPLE download failed after two attempts. Error message:
## HTTP error 400.
## Removing BAD APPLE.
## # A tibble: 5,102 x 8
## symbol date open high low close volume adjusted
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 2008-01-02 28.5 28.6 27.5 27.8 269794700 18.8
## 2 AAPL 2008-01-03 27.9 28.2 27.5 27.8 210516600 18.9
## 3 AAPL 2008-01-04 27.4 27.6 25.6 25.7 363958000 17.4
## 4 AAPL 2008-01-07 25.9 26.2 24.3 25.4 518048300 17.2
## 5 AAPL 2008-01-08 25.7 26.1 24.4 24.5 380954000 16.6
## 6 AAPL 2008-01-09 24.5 25.6 24.0 25.6 453470500 17.3
## 7 AAPL 2008-01-10 25.4 25.9 25.1 25.4 370743800 17.2
## 8 AAPL 2008-01-11 25.1 25.4 24.3 24.7 308071400 16.7
## 9 AAPL 2008-01-14 25.4 25.6 25.0 25.5 275112600 17.3
## 10 AAPL 2008-01-15 25.4 25.6 23.5 24.1 585819500 16.3
## # ... with 5,092 more rows
Now switching complete_cases = FALSE
will retain any errors as NA
values in a nested data frame. Notice that the error message and output change. The error message now states that the NA
values exist in the output and the return is a “nested” data structure.
c("AAPL", "GOOG", "BAD APPLE") %>%
tq_get(get = "stock.prices", complete_cases = FALSE)
## Warning: x = 'BAD APPLE', get = 'stock.prices': Error: BAD APPLE download failed after two attempts. Error message:
## HTTP error 400.
## Warning in value[[3L]](cond): Returning as nested data frame.
## # A tibble: 3 x 2
## symbol stock.prices
## <chr> <list>
## 1 AAPL <tibble [2,551 × 7]>
## 2 GOOG <tibble [2,551 × 7]>
## 3 BAD APPLE <lgl [1]>
In both cases, the prudent user will review the warnings to determine what happened and whether or not this is acceptable. In the complete_cases = FALSE
example, if the user attempts to perform downstream computations at scale, the computations will likely fail grinding the analysis to a hault. But, the advantage is that the user will more easily be able to filter to the problem childs to determine what happened and decide whether this is acceptable or not.