Using seplyr to Program Over dplyr

John Mount

2018-02-21

seplyr is an R package that makes it easy to program over dplyr 0.7.*.

To illustrate this we will work an example.

Suppose you had worked out a dplyr pipeline that performed an analysis you were interested in. For an example we could take something similar to one of the examples from the dplyr 0.7.0 announcement.

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
## [1] '0.7.4'
cat(colnames(starwars), sep = '\n')
## name
## height
## mass
## hair_color
## skin_color
## eye_color
## birth_year
## gender
## homeworld
## species
## films
## vehicles
## starships
starwars %>%
  group_by(homeworld) %>%
  summarise(mean_height = mean(height, na.rm = TRUE),
            mean_mass = mean(mass, na.rm = TRUE),
            count = n())
## # A tibble: 49 x 4
##    homeworld      mean_height mean_mass count
##    <chr>                <dbl>     <dbl> <int>
##  1 Alderaan             176        64.0     3
##  2 Aleen Minor           79.0      15.0     1
##  3 Bespin               175        79.0     1
##  4 Bestine IV           180       110       1
##  5 Cato Neimoidia       191        90.0     1
##  6 Cerea                198        82.0     1
##  7 Champala             196       NaN       1
##  8 Chandrila            150       NaN       1
##  9 Concord Dawn         183        79.0     1
## 10 Corellia             175        78.5     2
## # ... with 39 more rows

The above is colloquially called “an interactive script.” The name comes from the fact that we use names of variables (such as “homeworld”) that would only be known from looking at the data directly in the analysis code. Only somebody interacting with the data could write such a script (hence the name).

It has long been considered a point of discomfort to convert such an interactive dplyr pipeline into a re-usable script or function. That is a script or function that specifies column names in some parametric or re-usable fashion. Roughly it means the names of the data columns are not yet known when we are writing the code (and this is what makes the code re-usable).

This inessential (or conquerable) difficulty is largely a due to the preference for non-standard evaluation interfaces (that is interfaces that capture and inspect un-evaluated expressions from their calling interface) in the design dplyr.

seplyr is a dplyr adapter layer that prefers “slightly clunkier” standard interfaces (or referentially transparent interfaces), which are actually very powerful and can be used to some advantage.

The above description and comparisons can come off as needlessly broad and painfully abstract. Things are much clearer if we move away from theory and return to our practical example.

Let’s translate the above example into a re-usable function in small (easy) stages. First translate the interactive script from dplyr notation into seplyr notation. This step is a pure re-factoring, we are changing the code without changing its observable external behavior.

The translation is mechanical in that it is mostly using seplyr documentation as a lookup table. What you have to do is:

Our converted code looks like the following.

library("seplyr")

starwars %>%
  group_by_se("homeworld") %>%
  summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)",
                 "mean_mass" := "mean(mass, na.rm = TRUE)",
                 "count" := "n()"))
## # A tibble: 49 x 4
##    homeworld      mean_height mean_mass count
##    <chr>                <dbl>     <dbl> <int>
##  1 Alderaan             176        64.0     3
##  2 Aleen Minor           79.0      15.0     1
##  3 Bespin               175        79.0     1
##  4 Bestine IV           180       110       1
##  5 Cato Neimoidia       191        90.0     1
##  6 Cerea                198        82.0     1
##  7 Champala             196       NaN       1
##  8 Chandrila            150       NaN       1
##  9 Concord Dawn         183        79.0     1
## 10 Corellia             175        78.5     2
## # ... with 39 more rows

This code works the same as the original dplyr code. Obviously at this point all we have done is: worked to make the code a bit less pleasant looking. We have yet to see any benefit from this conversion (though we can turn this on its head and say all the original dplyr notation is saving us is from having to write a few quote marks).

The benefit is: this new code can very easily be parameterized and wrapped in a re-usable function. In fact it is now simpler to do than to describe.

For example: suppose (as in the original example) we want to create a function that lets us choose the grouping variable? This is now easy, we copy the code into a function and replace the explicit value "homeworld" with a variable:

starwars_mean <- function(my_var) {
  starwars %>%
    group_by_se(my_var) %>%
    summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)",
                   "mean_mass" := "mean(mass, na.rm = TRUE)",
                   "count" := "n()"))
}

starwars_mean("hair_color")
## # A tibble: 13 x 4
##    hair_color    mean_height mean_mass count
##    <chr>               <dbl>     <dbl> <int>
##  1 auburn                150     NaN       1
##  2 auburn, grey          180     NaN       1
##  3 auburn, white         182      77.0     1
##  4 black                 174      73.1    13
##  5 blond                 177      80.5     3
##  6 blonde                168      55.0     1
##  7 brown                 175      79.3    18
##  8 brown, grey           178     120       1
##  9 grey                  170      75.0     1
## 10 none                  181      78.5    37
## 11 unknown               NaN     NaN       1
## 12 white                 156      59.7     4
## 13 <NA>                  142     314       5

In seplyr programming is easy (just replace values with variables). For example we can make a completely generic re-usable “grouped mean” function using R’s paste() function to build up expressions.

grouped_mean <- function(data, 
                         grouping_variables, 
                         value_variables) {
  result_names <- paste0("mean_", 
                         value_variables)
  expressions <- paste0("mean(", 
                        value_variables, 
                        ", na.rm = TRUE)")
  calculation <- result_names := expressions
  print(as.list(calculation)) # print for demonstration
  data %>%
    group_by_se(grouping_variables) %>%
    summarize_se(c(calculation,
                   "count" := "n()"))
}

starwars %>% 
  grouped_mean(grouping_variables = "eye_color",
               value_variables = c("mass", "birth_year"))
## $mean_mass
## [1] "mean(mass, na.rm = TRUE)"
## 
## $mean_birth_year
## [1] "mean(birth_year, na.rm = TRUE)"
## # A tibble: 15 x 4
##    eye_color     mean_mass mean_birth_year count
##    <chr>             <dbl>           <dbl> <int>
##  1 black              76.3            33.0    10
##  2 blue               86.5            67.1    19
##  3 blue-gray          77.0            57.0     1
##  4 brown              66.1           109      21
##  5 dark              NaN             NaN       1
##  6 gold              NaN             NaN       1
##  7 green, yellow     159             NaN       1
##  8 hazel              66.0            34.5     3
##  9 orange            282             231       8
## 10 pink              NaN             NaN       1
## 11 red                81.4            33.7     5
## 12 red, blue         NaN             NaN       1
## 13 unknown            31.5           NaN       3
## 14 white              48.0           NaN       1
## 15 yellow             81.1            76.4    11

The only part that requires more study and practice was messing around with the expressions using paste0() (for more details on the string manipulation please try “help(paste)”). Notice also we used the “:=” operator to bind the list of desired result names to the matching calculations (please see “help(named_map_builder)” for more details).

The seplyr methodology is simple, easy to teach, and powerful. The package contains a number of worked examples both in help() and vignette(package='seplyr') documentation. For more details please also see: help(:=, package = 'wrapr') and help("%.>%", package="wrapr").