It is common for a manuscript to require a data summary table. Simple summary statistics for the whole sample and for subgroups therein. There are many tools available for easing the difficulty involved in building such tables. It is my opinion, that most of these tool exists with sufficient implicit biases and nuances imposed by the authors of the tools that other users need to not only understand the tool, but to think like the author. I find myself approaching many problems in ways that few others do. As a result, I needed my own tool for building data summary tables. I hope you like these tools and will be able to use it in your work.
This vignette presents the use of the summary_table
, tab_summary
, and qable
functions for quickly building data summary tables. These functions implicitly use the mean_sd
, median_iqr
, and n_perc0
functions from qwraps2
as well.
We will use a modified version of the mtcars
data set for examples throughout this vignette. The following packages are required to run the code in this vignette and to construct the mtcars2
data.frame
.
The mtcars2
data frame will have three versions of the cyl
vector, the original numeric values in cyl
, a character
version, and a factor
version.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(qwraps2)
# define the markup language we are working in.
# options(qwraps2_markup = "latex") is also supported.
options(qwraps2_markup = "markdown")
data(mtcars)
mtcars2 <-
dplyr::mutate(mtcars,
cyl_factor = factor(cyl,
levels = c(6, 4, 8),
labels = paste(c(6, 4, 8), "cylinders")),
cyl_character = paste(cyl, "cylinders"))
str(mtcars2)
## 'data.frame': 32 obs. of 13 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp : num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear : num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4 ...
## $ cyl_factor : Factor w/ 3 levels "6 cylinders",..: 1 1 2 1 3 1 3 2 2 1 ...
## $ cyl_character: chr "6 cylinders" "6 cylinders" "4 cylinders" "6 cylinders" ...
Notice that in the construction of the cyl_factor
and cyl_character
vectors was done such that the coercion of cyl_character
to a factor
will not be the same as the cyl_factor
vector, the levels are in a different order.
with(mtcars2, table(cyl_factor, cyl_character))
## cyl_character
## cyl_factor 4 cylinders 6 cylinders 8 cylinders
## 6 cylinders 0 7 0
## 4 cylinders 11 0 0
## 8 cylinders 0 0 14
with(mtcars2, all.equal(factor(cyl_character), cyl_factor))
## [1] "Attributes: < Component \"levels\": 2 string mismatches >"
Let’s review some of the formatting functions provided by the qwraps2
package
mean_sd
will return the (arithmetic) mean and standard deviation for numeric vector, i.e., mean_sd(mtcars2$mpg)
will return the formatted string.
mean_sd(mtcars2$mpg)
## [1] "20.09 ± 6.03"
mean_sd(mtcars2$mpg, denote_sd = "paren")
## [1] "20.09 (6.03)"
The default setting for mean_sd
is to return the mean ± sd. In a table this default is helpful sense the default formating for counts and percentages is n (%).
The mean_sd
, and other functions, are helpful for in-line text too:
The
nrow(mtcars2)
vehicles in themtcars
data set had an average fuel economy of `mean_sd(mtcars$mpg) miles per gallon.
produces
The 32 vehicles in the
mtcars
data set had an average fuel economy of 20.09 ± 6.03 miles per gallon.
If you need the mean and a confidence interval there is mean_ci
. mean_ci
returns a qwraps2_mean_ci
object which is a named vector with the mean, lower confidence limit, and the upper confidence limit. There printing method for qwraps2_mean_ci
objects is a call to the frmtci
function. You an modify the formating of printed result via adjusting the arguments pasted to frmtci
mci <- mean_ci(mtcars2$mpg)
mci
## [1] "20.09 (18.00, 22.18)"
print(mci, show_level = TRUE)
## [1] "20.09 (95% CI: 18.00, 22.18)"
Similar to the mean_sd
function, the median_iqr
returns the median and the inner quartile range (IQR) of a data vector.
median_iqr(mtcars2$mpg)
## [1] "19.20 (15.43, 22.80)"
The n_perc
function is the workhorse, but n_perc0
is also provided for ease of use in the same way that base R has paste
and paste0
. n_perc
returns the n (%) with the percentage sign in the string, n_perc0
omits the percentage sign from the string. The latter is good for table, the former for in-line text.
n_perc(mtcars2$cyl == 4)
## [1] "11 (34.38%)"
n_perc0(mtcars2$cyl == 4)
## [1] "11 (34)"
n_perc(mtcars2$cyl_factor == 4) # this returns 0 (0.00%)
## [1] "0 (0.00%)"
n_perc(mtcars2$cyl_factor == "4 cylinders")
## [1] "11 (34.38%)"
n_perc(mtcars2$cyl_factor == levels(mtcars2$cyl_factor)[2])
## [1] "11 (34.38%)"
# The count and percentage of 4 or 6 cylinders vehicles in the data set is
n_perc(mtcars2$cyl %in% c(4, 6))
## [1] "18 (56.25%)"
Objective: build a table reporting summary statistics for some of the variables in the mtcars2
data.frame
overall, and within subgroups. We’ll start with something very simple and build up to something bigger.
Let’s report the min, max, and mean (sd) for continuous variables and n (%) for categorical variables. We will report mpg
, disp
, wt
, and gear
overall and by number of cylinders.
The function summary_table
, along with some dplyr
functions will do the work for us. summary_table
takes two arguments:
.data
a (grouped_df
) data.framesummaries
a list of summaries. This is a list-of-lists. The outer list defines the row groups and the inner lists define the specif summaries.args(summary_table)
## function (.data, summaries)
## NULL
Let’s build a list-of-lists to pass to the summaries
argument of summary_table
. The inner lists are named formula
e defining the wanted summary. These formula
e are passed through dplyr::summarize_
to generate the table. The names are important, as they are used to label row groups and row names in the table.
our_summary1 <-
list("Miles Per Gallon" =
list("min" = ~ min(mpg),
"max" = ~ max(mpg),
"mean (sd)" = ~ qwraps2::mean_sd(mpg)),
"Displacement" =
list("min" = ~ min(disp),
"max" = ~ max(disp),
"mean (sd)" = ~ qwraps2::mean_sd(disp)),
"Weight (1000 lbs)" =
list("min" = ~ min(wt),
"max" = ~ max(wt),
"mean (sd)" = ~ qwraps2::mean_sd(wt)),
"Forward Gears" =
list("Three" = ~ qwraps2::n_perc0(gear == 3),
"Four" = ~ qwraps2::n_perc0(gear == 4),
"Five" = ~ qwraps2::n_perc0(gear == 5))
)
Building the table is done with a call to summary_table
:
summary_table(mtcars2, our_summary1)
mtcars2 (N = 32) | |
---|---|
Miles Per Gallon | |
min | 10.4 |
max | 33.9 |
mean (sd) | 20.09 ± 6.03 |
Displacement | |
min | 71.1 |
max | 472 |
mean (sd) | 230.72 ± 123.94 |
Weight (1000 lbs) | |
min | 1.513 |
max | 5.424 |
mean (sd) | 3.22 ± 0.98 |
Forward Gears | |
Three | 15 (47) |
Four | 12 (38) |
Five | 5 (16) |
summary_table(dplyr::group_by(mtcars2, cyl_factor), our_summary1)
cyl_factor: 6 cylinders (N = 7) | cyl_factor: 4 cylinders (N = 11) | cyl_factor: 8 cylinders (N = 14) | |
---|---|---|---|
Miles Per Gallon | |||
min | 17.8 | 21.4 | 10.4 |
max | 21.4 | 33.9 | 19.2 |
mean (sd) | 19.74 ± 1.45 | 26.66 ± 4.51 | 15.10 ± 2.56 |
Displacement | |||
min | 145.0 | 71.1 | 275.8 |
max | 258.0 | 146.7 | 472.0 |
mean (sd) | 183.31 ± 41.56 | 105.14 ± 26.87 | 353.10 ± 67.77 |
Weight (1000 lbs) | |||
min | 2.620 | 1.513 | 3.170 |
max | 3.460 | 3.190 | 5.424 |
mean (sd) | 3.12 ± 0.36 | 2.29 ± 0.57 | 4.00 ± 0.76 |
Forward Gears | |||
Three | 2 (29) | 1 (9) | 12 (86) |
Four | 4 (57) | 8 (73) | 0 (0) |
Five | 1 (14) | 2 (18) | 2 (14) |
If you want to change the column names do so via the cnames
argument to qable
via the print method for qwraps2_summary_table
objects.
print(summary_table(dplyr::group_by(mtcars2, cyl_factor), our_summary1),
cnames = c("Col 1", "Col 2", "Col 3"))
Col 1 | Col 2 | Col 3 | |
---|---|---|---|
Miles Per Gallon | |||
min | 17.8 | 21.4 | 10.4 |
max | 21.4 | 33.9 | 19.2 |
mean (sd) | 19.74 ± 1.45 | 26.66 ± 4.51 | 15.10 ± 2.56 |
Displacement | |||
min | 145.0 | 71.1 | 275.8 |
max | 258.0 | 146.7 | 472.0 |
mean (sd) | 183.31 ± 41.56 | 105.14 ± 26.87 | 353.10 ± 67.77 |
Weight (1000 lbs) | |||
min | 2.620 | 1.513 | 3.170 |
max | 3.460 | 3.190 | 5.424 |
mean (sd) | 3.12 ± 0.36 | 2.29 ± 0.57 | 4.00 ± 0.76 |
Forward Gears | |||
Three | 2 (29) | 1 (9) | 12 (86) |
Four | 4 (57) | 8 (73) | 0 (0) |
Five | 1 (14) | 2 (18) | 2 (14) |
The task building the summaries
list-of-lists can be tedious. tab_summary
is provided to help with that. For numeric
variable, tab_summary
will provide the formula
e for the min, median (iqr), mean (sd), and max. factor
and character
vectors will have calls to qwraps2::n_perc0
for all levels provided.
tab_summary(mtcars2$mpg)
## $min
## ~min(mtcars2$mpg)
## <environment: 0x4613ce0>
##
## $`median (IQR)`
## ~qwraps2::median_iqr(mtcars2$mpg)
## <environment: 0x4613ce0>
##
## $`mead (sd)`
## ~qwraps2::mean_sd(mtcars2$mpg)
## <environment: 0x4613ce0>
##
## $max
## ~max(mtcars2$mpg)
## <environment: 0x4613ce0>
tab_summary(mtcars2$gear) # gear is a numeric vector!
## $min
## ~min(mtcars2$gear)
## <environment: 0x34f9ff0>
##
## $`median (IQR)`
## ~qwraps2::median_iqr(mtcars2$gear)
## <environment: 0x34f9ff0>
##
## $`mead (sd)`
## ~qwraps2::mean_sd(mtcars2$gear)
## <environment: 0x34f9ff0>
##
## $max
## ~max(mtcars2$gear)
## <environment: 0x34f9ff0>
tab_summary(factor(mtcars2$gear))
## $`3`
## ~qwraps2::n_perc0(factor(mtcars2$gear) == "3")
## <environment: 0x330c978>
##
## $`4`
## ~qwraps2::n_perc0(factor(mtcars2$gear) == "4")
## <environment: 0x32f5fa8>
##
## $`5`
## ~qwraps2::n_perc0(factor(mtcars2$gear) == "5")
## <environment: 0x32eb5a0>
The our_summary1
object can be recreated as follows. Some additional row groups are provided to show default behavior of tab_summary
.
our_summary2 <-
with(mtcars2,
list("Miles Per Gallon" = tab_summary(mpg)[c(1, 4, 3)],
"Displacement (default summary)" = tab_summary(disp),
"Displacement" = c(tab_summary(disp)[c(1, 4, 3)],
"mean (95% CI)" = ~ frmtci(qwraps2::mean_ci(disp))),
"Weight (1000 lbs)" = tab_summary(wt)[c(1, 4, 3)],
"Forward Gears" = tab_summary(as.character(gear))
))
whole <- summary_table(mtcars2, our_summary2)
whole
mtcars2 (N = 32) | |
---|---|
Miles Per Gallon | |
min | 10.4 |
max | 33.9 |
mead (sd) | 20.09 ± 6.03 |
Displacement (default summary) | |
min | 71.1 |
median (IQR) | 196.30 (120.83, 326.00) |
mead (sd) | 230.72 ± 123.94 |
max | 472 |
Displacement | |
min | 71.1 |
max | 472 |
mead (sd) | 230.72 ± 123.94 |
mean (95% CI) | 230.72 (187.78, 273.66) |
Weight (1000 lbs) | |
min | 1.513 |
max | 5.424 |
mead (sd) | 3.22 ± 0.98 |
Forward Gears | |
3 | 15 (47) |
4 | 12 (38) |
5 | 5 (16) |
Group by muliple factors:
grouped <- summary_table(dplyr::group_by(mtcars2, am, vs), our_summary2)
grouped
am: 0 vs: 0 (N = 12) | am: 0 vs: 1 (N = 7) | am: 1 vs: 0 (N = 6) | am: 1 vs: 1 (N = 7) | |
---|---|---|---|---|
Miles Per Gallon | ||||
min | 10.4 | 17.8 | 15.0 | 21.4 |
max | 19.2 | 24.4 | 26.0 | 33.9 |
mead (sd) | 15.05 ± 2.77 | 20.74 ± 2.47 | 19.75 ± 4.01 | 28.37 ± 4.76 |
Displacement (default summary) | ||||
min | 275.8 | 120.1 | 120.3 | 71.1 |
median (IQR) | 355.00 (296.95, 410.00) | 167.60 (143.75, 196.30) | 160.00 (148.75, 265.75) | 79.00 (77.20, 101.55) |
mead (sd) | 357.62 ± 71.82 | 175.11 ± 49.13 | 206.22 ± 95.23 | 89.80 ± 18.80 |
max | 472 | 258 | 351 | 121 |
Displacement | ||||
min | 275.8 | 120.1 | 120.3 | 71.1 |
max | 472 | 258 | 351 | 121 |
mead (sd) | 357.62 ± 71.82 | 175.11 ± 49.13 | 206.22 ± 95.23 | 89.80 ± 18.80 |
mean (95% CI) | 357.62 (316.98, 398.25) | 175.11 (138.72, 211.51) | 206.22 (130.02, 282.42) | 89.80 (75.87, 103.73) |
Weight (1000 lbs) | ||||
min | 3.435 | 2.465 | 2.140 | 1.513 |
max | 5.424 | 3.460 | 3.570 | 2.780 |
mead (sd) | 4.10 ± 0.77 | 3.19 ± 0.35 | 2.86 ± 0.49 | 2.03 ± 0.44 |
Forward Gears | ||||
3 | 12 (100) | 3 (43) | 0 (0) | 0 (0) |
4 | 0 (0) | 4 (57) | 2 (33) | 6 (86) |
5 | 0 (0) | 0 (0) | 4 (67) | 1 (14) |
As one table:
both <- cbind(whole, grouped)
both
mtcars2 (N = 32) | am: 0 vs: 0 (N = 12) | am: 0 vs: 1 (N = 7) | am: 1 vs: 0 (N = 6) | am: 1 vs: 1 (N = 7) | |
---|---|---|---|---|---|
Miles Per Gallon | |||||
min | 10.4 | 10.4 | 17.8 | 15.0 | 21.4 |
max | 33.9 | 19.2 | 24.4 | 26.0 | 33.9 |
mead (sd) | 20.09 ± 6.03 | 15.05 ± 2.77 | 20.74 ± 2.47 | 19.75 ± 4.01 | 28.37 ± 4.76 |
Displacement (default summary) | |||||
min | 71.1 | 275.8 | 120.1 | 120.3 | 71.1 |
median (IQR) | 196.30 (120.83, 326.00) | 355.00 (296.95, 410.00) | 167.60 (143.75, 196.30) | 160.00 (148.75, 265.75) | 79.00 (77.20, 101.55) |
mead (sd) | 230.72 ± 123.94 | 357.62 ± 71.82 | 175.11 ± 49.13 | 206.22 ± 95.23 | 89.80 ± 18.80 |
max | 472 | 472 | 258 | 351 | 121 |
Displacement | |||||
min | 71.1 | 275.8 | 120.1 | 120.3 | 71.1 |
max | 472 | 472 | 258 | 351 | 121 |
mead (sd) | 230.72 ± 123.94 | 357.62 ± 71.82 | 175.11 ± 49.13 | 206.22 ± 95.23 | 89.80 ± 18.80 |
mean (95% CI) | 230.72 (187.78, 273.66) | 357.62 (316.98, 398.25) | 175.11 (138.72, 211.51) | 206.22 (130.02, 282.42) | 89.80 (75.87, 103.73) |
Weight (1000 lbs) | |||||
min | 1.513 | 3.435 | 2.465 | 2.140 | 1.513 |
max | 5.424 | 5.424 | 3.460 | 3.570 | 2.780 |
mead (sd) | 3.22 ± 0.98 | 4.10 ± 0.77 | 3.19 ± 0.35 | 2.86 ± 0.49 | 2.03 ± 0.44 |
Forward Gears | |||||
3 | 15 (47) | 12 (100) | 3 (43) | 0 (0) | 0 (0) |
4 | 12 (38) | 0 (0) | 4 (57) | 2 (33) | 6 (86) |
5 | 5 (16) | 0 (0) | 0 (0) | 4 (67) | 1 (14) |