Using corrr

Simon Jackson

2016-10-05

corrr is a package for exploring correlations in R. It makes it possible to easily perform routine tasks when exploring correlation matrices such as ignoring the diagonal, focusing on the correlations of certain variables against others, or rearranging and visualising the matrix in terms of the strength of the correlations.

Using corrr

Using corrr starts with correlate(), which acts like the base correlation function cor(). It differs by defaulting to pairwise deletion, and returning a correlation data frame (cor_df) of the following structure:

To work with further, let’s create a correlation data frame using correlate() from the mtcars data that comes with R:

library(corrr)
d <- correlate(mtcars)
d
#> # A tibble: 11 × 12
#>    rowname        mpg        cyl       disp         hp        drat
#>      <chr>      <dbl>      <dbl>      <dbl>      <dbl>       <dbl>
#> 1      mpg         NA -0.8521620 -0.8475514 -0.7761684  0.68117191
#> 2      cyl -0.8521620         NA  0.9020329  0.8324475 -0.69993811
#> 3     disp -0.8475514  0.9020329         NA  0.7909486 -0.71021393
#> 4       hp -0.7761684  0.8324475  0.7909486         NA -0.44875912
#> 5     drat  0.6811719 -0.6999381 -0.7102139 -0.4487591          NA
#> 6       wt -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065
#> 7     qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476
#> 8       vs  0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846
#> 9       am  0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113
#> 10    gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013
#> 11    carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980
#> # ... with 6 more variables: wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>,
#> #   gear <dbl>, carb <dbl>

Why a correlation data frame?

At first, a correlation data frame might seem like an unneccessary complexity compared to the traditional matrix. However, the purpose of corrr is to help use explore these correlations, not to do mathematical or statistical operations. Thus, by having the correlations in a data frame, we can make use of packages that help us work with data frames like dplyr, tidyr, ggplot2, and focus on using data pipelines. Lets look at some examples:

library(dplyr)

# Filter rows to occasions in which cyl has a correlation of .7 or more with
# another variable.
d %>% filter(cyl > .7)
#> # A tibble: 3 × 12
#>   rowname        mpg       cyl      disp        hp       drat        wt
#>     <chr>      <dbl>     <dbl>     <dbl>     <dbl>      <dbl>     <dbl>
#> 1    disp -0.8475514 0.9020329        NA 0.7909486 -0.7102139 0.8879799
#> 2      hp -0.7761684 0.8324475 0.7909486        NA -0.4487591 0.6587479
#> 3      wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.7124406        NA
#> # ... with 5 more variables: qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>,
#> #   carb <dbl>

# Select the mpg, cyl and disp columns (and rowname)
d %>% select(rowname, mpg, cyl, disp)
#> # A tibble: 11 × 4
#>    rowname        mpg        cyl       disp
#>      <chr>      <dbl>      <dbl>      <dbl>
#> 1      mpg         NA -0.8521620 -0.8475514
#> 2      cyl -0.8521620         NA  0.9020329
#> 3     disp -0.8475514  0.9020329         NA
#> 4       hp -0.7761684  0.8324475  0.7909486
#> 5     drat  0.6811719 -0.6999381 -0.7102139
#> 6       wt -0.8676594  0.7824958  0.8879799
#> 7     qsec  0.4186840 -0.5912421 -0.4336979
#> 8       vs  0.6640389 -0.8108118 -0.7104159
#> 9       am  0.5998324 -0.5226070 -0.5912270
#> 10    gear  0.4802848 -0.4926866 -0.5555692
#> 11    carb -0.5509251  0.5269883  0.3949769

# Combine above in a single pipeline
d %>%
  filter(cyl > .7) %>% 
  select(rowname, mpg, cyl, disp)
#> # A tibble: 3 × 4
#>   rowname        mpg       cyl      disp
#>     <chr>      <dbl>     <dbl>     <dbl>
#> 1    disp -0.8475514 0.9020329        NA
#> 2      hp -0.7761684 0.8324475 0.7909486
#> 3      wt -0.8676594 0.7824958 0.8879799

Furthermore, by having the diagonal set to missing, we don’t need to put in extra effort to ignore them when summarising the correlations. For example:

# Compute mean of each column
library(purrr)
d %>% select(-rowname) %>% map_dbl(~ mean(., na.rm = TRUE))
#>           mpg           cyl          disp            hp          drat 
#> -0.1050454113 -0.0925483176 -0.0872737071  0.0006800268 -0.0037165212 
#>            wt          qsec            vs            am          gear 
#> -0.0828684293 -0.1752247305 -0.1145625942  0.0053087327  0.0484120552 
#>          carb 
#>  0.0563419513

API

As the above section suggests, the corrr API is designed with data pipelines in mind (e.g., to use %>% from the magrittr package). After correlate(), the primary corrr functions take a cor_df as their first argument, and return a cor_df or tbl (or output like a plot). These functions serve one of three purposes:

Internal changes (cor_df out):

Reshape structure (tbl or cor_df out):

Output/visualisations (console/plot out):

By combing these functions in data pipelines, it’s possible to easily explore your correlations.

For example, lets focus on the correlations of mpg and cyl with all the others:

d %>% focus(mpg, cyl)
#> # A tibble: 9 × 3
#>   rowname        mpg        cyl
#>     <chr>      <dbl>      <dbl>
#> 1    disp -0.8475514  0.9020329
#> 2      hp -0.7761684  0.8324475
#> 3    drat  0.6811719 -0.6999381
#> 4      wt -0.8676594  0.7824958
#> 5    qsec  0.4186840 -0.5912421
#> 6      vs  0.6640389 -0.8108118
#> 7      am  0.5998324 -0.5226070
#> 8    gear  0.4802848 -0.4926866
#> 9    carb -0.5509251  0.5269883

Or maybe we want to focus in on a few variables (mirrored in rows too) and print the correlations without an upper triangle and fashioned to look nice:

d %>%
  focus(mpg:drat, mirror = TRUE) %>%  # Focus only on mpg:drat
  shave() %>% # Remove the upper triangle
  fashion()   # Print in nice format 
#>   rowname  mpg  cyl disp   hp drat
#> 1     mpg                         
#> 2     cyl -.85                    
#> 3    disp -.85  .90               
#> 4      hp -.78  .83  .79          
#> 5    drat  .68 -.70 -.71 -.45

Alternatively, we can visualise these correlations (let’s clear the lower triangle for a change):

d %>%
  focus(mpg:drat, mirror = TRUE) %>%
  shave(upper = FALSE) %>%
  rplot()     # Plot

Perhaps we’d like to rearrange the correlations so that the plot becomes easier to interpret. In this case, we can add rearrange() into our pipeline before shaving one of the triangles (we’ll take correlation sign into account with absolute = FALSE).

d %>%
  focus(mpg:drat, mirror = TRUE) %>%
  rearrange(absolute = FALSE) %>% 
  shave() %>%
  rplot()

Other Resources

For other resources about how to use corrr, you’ll find plenty of posts explaining functions at blogR, or keep up to date with these on Twitter by following @drsimonj.