Column formats

library(tibble)

Overview

This vignette shows how to decorate columns for custom formatting. We use the formattable package for demonstration because it already contains useful vector classes that apply a custom formatting to numbers.

library(formattable)

tbl <- tibble(x = digits(9:11, 3))
tbl
#> # A tibble: 3 x 1
#>   x         
#>   <formttbl>
#> 1 9.000     
#> 2 10.000    
#> 3 11.000

The x column in the tibble above is a regular number with a formatting method. It always will be shown with three digits after the decimal point. This also applies to columns derived from x.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
tbl2 <- 
  tbl %>%
  mutate(
    y = x + 1, 
    z = x * x, 
    v = y + z,
    lag = lag(x, default = x[[1]]),
    sin = sin(x),
    mean = mean(v),
    var = var(x)
  )

tbl2
#> # A tibble: 3 x 8
#>           x         y         z         v       lag       sin      mean   var
#>   <dbl:fmt> <dbl:fmt> <dbl:fmt> <dbl:fmt> <dbl:fmt> <dbl:fmt> <dbl:fmt> <dbl>
#> 1     9.000    10.000    81.000    91.000     9.000     0.412   111.667     1
#> 2    10.000    11.000   100.000   111.000     9.000    -0.544   111.667     1
#> 3    11.000    12.000   121.000   133.000    10.000    -1.000   111.667     1

Summaries also maintain the formatting.

tbl2 %>% 
  group_by(lag) %>% 
  summarize(z = mean(z)) %>% 
  ungroup()
#> `summarise()` has ungrouped output. You can override using the `.groups` argument.
#> # A tibble: 2 x 2
#>         lag         z
#> * <dbl:fmt> <dbl:fmt>
#> 1     9.000    90.500
#> 2    10.000   121.000

Same for pivoting operations.

library(tidyr)

stocks <- 
  expand_grid(id = factor(1:4), year = 2018:2022) %>% 
  mutate(stock = currency(runif(20) * 10000))

stocks %>% 
  pivot_wider(id, names_from = year, values_from = stock)
#> # A tibble: 4 x 6
#>   id       `2018`    `2019`    `2020`    `2021`    `2022`
#>   <fct> <dbl:fmt> <dbl:fmt> <dbl:fmt> <dbl:fmt> <dbl:fmt>
#> 1 1     $6,507.70   $431.66 $5,859.09 $5,571.28 $6,061.56
#> 2 2       $437.98 $1,747.83 $7,210.73 $2,276.47   $376.59
#> 3 3     $7,083.80 $2,949.92 $3,025.57 $9,393.45    $69.33
#> 4 4     $1,954.84 $7,156.63 $9,141.02 $7,885.18 $9,380.82

For ggplot2 we need to do some work to show apply the formatting to the scales.

library(ggplot2)

# Needs https://github.com/tidyverse/ggplot2/pull/4065 or similar
stocks %>% 
  ggplot(aes(x = year, y = stock, color = id)) +
  geom_line()

It pays off to specify formatting very early in the process. The diagram below shows the principal stages of data analysis and exploration from “R for data science”.

The subsequent diagram adds data formats, communication options, and explicit data formatting. The original r4ds transitions are highlighted in bold. There are two principal options where to apply formatting for results: right before communicating them, or right after importing.

Applying formatting early in the process gives the added benefit of showing the data in a useful format during the “Tidy”, “Transform”, and “Visualize” stages. For this to be useful, we need to ensure that the formatting options applied early:

Ensuring stickiness is difficult, and is insufficient for a dbplyr workflow where parts of the “Tidy”, “Transform” or even “Visualize” stages are run on the database. Often it’s possible to derive a rule-based approach for formatting.

tbl3 <- 
  tibble(id = letters[1:3], x = 9:11) %>% 
  mutate(
    y = x + 1, 
    z = x * x, 
    v = y + z,
    lag = lag(x, default = x[[1]]),
    sin = sin(x),
    mean = mean(v),
    var = var(x)
  )

tbl3
#> # A tibble: 3 x 9
#>   id        x     y     z     v   lag    sin  mean   var
#>   <chr> <int> <dbl> <int> <dbl> <int>  <dbl> <dbl> <dbl>
#> 1 a         9    10    81    91     9  0.412  112.     1
#> 2 b        10    11   100   111     9 -0.544  112.     1
#> 3 c        11    12   121   133    10 -1.00   112.     1

tbl3 %>% 
  mutate(
    across(where(is.numeric), digits, 3),
    across(where(~ is.numeric(.x) && mean(.x) > 50), digits, 1)
  )
#> # A tibble: 3 x 9
#>   id            x        y        z        v      lag      sin     mean      var
#>   <chr> <dbl:fmt> <dbl:fm> <dbl:fm> <dbl:fm> <dbl:fm> <dbl:fm> <dbl:fm> <dbl:fm>
#> 1 a         9.000   10.000     81.0     91.0    9.000    0.412    111.7    1.000
#> 2 b        10.000   11.000    100.0    111.0    9.000   -0.544    111.7    1.000
#> 3 c        11.000   12.000    121.0    133.0   10.000   -1.000    111.7    1.000

These rules can be stored in quos():

rules <- quos(
  across(where(is.numeric), digits, 3),
  across(where(~ is.numeric(.x) && mean(.x) > 50), digits, 1)
)

tbl3 %>% 
  mutate(!!!rules)
#> # A tibble: 3 x 9
#>   id            x        y        z        v      lag      sin     mean      var
#>   <chr> <dbl:fmt> <dbl:fm> <dbl:fm> <dbl:fm> <dbl:fm> <dbl:fm> <dbl:fm> <dbl:fm>
#> 1 a         9.000   10.000     81.0     91.0    9.000    0.412    111.7    1.000
#> 2 b        10.000   11.000    100.0    111.0    9.000   -0.544    111.7    1.000
#> 3 c        11.000   12.000    121.0    133.0   10.000   -1.000    111.7    1.000

This poses a few drawbacks:

What would a good API for rule-based formatting look like?