It is common for a manuscript to require a data summary table. The table might include simple summary statistics for the whole sample and for subgroups. There are several tools available to build such tables. In my opinion, though, most of those tools have nuances imposed by the creators/authors such that other users need not only understand the tool, but also think like the authors. I wrote this package to be as flexible and general as possible. I hope you like these tools and will be able to use them in your work.
This vignette presents the use of the summary_table
, tab_summary
, and qable
functions for quickly building data summary tables. These functions implicitly use the mean_sd
, median_iqr
, and n_perc0
functions from qwraps2
as well.
We will use a modified version of the mtcars
data set for examples throughout this vignette. The following packages are required to run the code in this vignette and to construct the mtcars2
data.frame
.
The mtcars2
data frame will have three versions of the cyl
vector: the original numeric values in cyl
, a character
version, and a factor
version.
set.seed(42)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(qwraps2)
# define the markup language we are working in.
# options(qwraps2_markup = "latex") is also supported.
options(qwraps2_markup = "markdown")
data(mtcars)
mtcars2 <-
dplyr::mutate(mtcars,
cyl_factor = factor(cyl,
levels = c(6, 4, 8),
labels = paste(c(6, 4, 8), "cylinders")),
cyl_character = paste(cyl, "cylinders"))
str(mtcars2)
## 'data.frame': 32 obs. of 13 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp : num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear : num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4 ...
## $ cyl_factor : Factor w/ 3 levels "6 cylinders",..: 1 1 2 1 3 1 3 2 2 1 ...
## $ cyl_character: chr "6 cylinders" "6 cylinders" "4 cylinders" "6 cylinders" ...
Notice that the construction of the cyl_factor
and cyl_character
vectors was done such that the coercion of cyl_character
to a factor
will not be the same as the cyl_factor
vector; the levels are in a different order.
with(mtcars2, table(cyl_factor, cyl_character))
## cyl_character
## cyl_factor 4 cylinders 6 cylinders 8 cylinders
## 6 cylinders 0 7 0
## 4 cylinders 11 0 0
## 8 cylinders 0 0 14
with(mtcars2, all.equal(factor(cyl_character), cyl_factor))
## [1] "Attributes: < Component \"levels\": 2 string mismatches >"
mean_sd
will return the (arithmetic) mean and standard deviation for numeric vector. For example, mean_sd(mtcars2$mpg)
will return the formatted string.
mean_sd(mtcars2$mpg)
## [1] "20.09 ± 6.03"
mean_sd(mtcars2$mpg, denote_sd = "paren")
## [1] "20.09 (6.03)"
The default setting for mean_sd
is to return the mean ± sd. In a table this default is helpful because the default table formatting for counts and percentages is n (%).
mean_sd
and other functions are helpful for in-line text too:
The 32 vehicles in the `mtcars` data set had an average fuel
economy of 20.09 ± 6.03 miles per gallon.
produces
The 32 vehicles in the
mtcars
data set had an average fuel economy of 20.09 ± 6.03 miles per gallon.
If you need the mean and a confidence interval there is mean_ci
. mean_ci
returns a qwraps2_mean_ci
object which is a named vector with the mean, lower confidence limit, and the upper confidence limit. The printing method for qwraps2_mean_ci
objects is a call to the frmtci
function. You an modify the formatting of printed result by adjusting the arguments pasted to frmtci
.
Similar to the mean_sd
function, the median_iqr
returns the median and the inner quartile range (IQR) of a data vector.
The n_perc
function is the workhorse, but n_perc0
is also provided for ease of use in the same way that base R has paste
and paste0
. n_perc
returns the n (%) with the percentage sign in the string, n_perc0
omits the percentage sign from the string. The latter is good for tables, the former for in-line text.
n_perc(mtcars2$cyl == 4)
## [1] "11 (34.38%)"
n_perc0(mtcars2$cyl == 4)
## [1] "11 (34)"
n_perc(mtcars2$cyl_factor == 4) # this returns 0 (0.00%)
## [1] "0 (0.00%)"
n_perc(mtcars2$cyl_factor == "4 cylinders")
## [1] "11 (34.38%)"
n_perc(mtcars2$cyl_factor == levels(mtcars2$cyl_factor)[2])
## [1] "11 (34.38%)"
# The count and percentage of 4 or 6 cylinders vehicles in the data set is
n_perc(mtcars2$cyl %in% c(4, 6))
## [1] "18 (56.25%)"
Let \(\left\{x_1, x_2, x_3, \ldots, x_n \right\}\) be a sample of size \(n\) with \(x_i > 0\) for all \(i.\) Then the geometric mean, \(\mu_g,\) and geometric standard deviation are in Equation @ref(eq:geometricmean) and @ref(eq:geometricsd) respectively.
\[ \begin{equation} (\#eq:geometricmean) \mu_g = \left( \prod_{i = 1}^{n} x_i \right)^{\frac{1}{n}} = b^{ \sum_{i = 1}^{n} \log_{b} x_i } \end{equation} \]
\[ \begin{equation} (\#eq:geometricsd) \sigma_g = b ^ { \sqrt{ \frac{\sum_{i = 1}^{n} \left( \log_{b} \frac{x_i}{\mu_g} \right)^2}{n}}} \end{equation} \]
When looking for the geometric standard deviation in R, the simple exp(sd(log(x)))
is not exactly correct. Note that in @ref(eq:geometricsd) the denominator is \(n,\) the full sample size, where as the sd
and var
functions in R use the denominator \(n - 1.\) To get the geometric standard deviation one should adjust the result by multiplying the variance by \((n - 1) / n\) or the standard deviation by \(\sqrt{(n - 1) / n}.\) See the example below.
x <- runif(6, min = 4, max = 70)
# geometric mean
mu_g <- prod(x) ** (1 / length(x))
mu_g
## [1] 46.50714
exp(mean(log(x)))
## [1] 46.50714
1.2 ** mean(log(x, base = 1.2))
## [1] 46.50714
# geometric standard deviation
exp(sd(log(x))) ## This is wrong
## [1] 1.500247
# these equations are correct
sigma_g <- exp(sqrt(sum(log(x / mu_g) ** 2) / length(x)))
sigma_g
## [1] 1.448151
exp(sqrt((length(x) - 1) / length(x)) * sd(log(x)))
## [1] 1.448151
The functions gmean
, gvar
, and gsd
in the package, provide the geometric mean, variance, and standard deviation for a sample.
gmean(x)
## [1] 46.50714
all.equal(gmean(x), mu_g)
## [1] TRUE
gvar(x)
## [1] 1.146958
all.equal(gvar(x), sigma_g^2) # This is supposed to be FALSE
## [1] "Mean relative difference: 0.8284385"
all.equal(gvar(x), exp(log(sigma_g)^2))
## [1] TRUE
gsd(x)
## [1] 1.448151
all.equal(gsd(x), sigma_g)
## [1] TRUE
gmean_sd
will provide a quick way for reporting the geometric mean and geometric standard deviation in the same way that mean_sd
does for the arithmetic mean and arithmetic standard deviation:
Objective: build a table reporting summary statistics for some of the variables in the mtcars2
data.frame
overall and within subgroups. We’ll start with something very simple and build up to something bigger.
Let’s report the min, max, and mean (sd) for continuous variables and n (%) for categorical variables. We will report mpg
, disp
, wt
, and gear
overall and by number of cylinders.
The function summary_table
, along with some dplyr
functions will do the work for us. summary_table
takes two arguments:
.data
a (grouped_df
) data.framesummaries
a list of summaries. This is a list-of-lists. The outer list defines the row groups and the inner lists define the specif summaries.Let’s build a list-of-lists to pass to the summaries
argument of summary_table
. The inner lists are named formula
e defining the wanted summary. These formula
e are passed through dplyr::summarize_
to generate the table. The names are important, as they are used to label row groups and row names in the table.
our_summary1 <-
list("Miles Per Gallon" =
list("min" = ~ min(mpg),
"max" = ~ max(mpg),
"mean (sd)" = ~ qwraps2::mean_sd(mpg)),
"Displacement" =
list("min" = ~ min(disp),
"max" = ~ max(disp),
"mean (sd)" = ~ qwraps2::mean_sd(disp)),
"Weight (1000 lbs)" =
list("min" = ~ min(wt),
"max" = ~ max(wt),
"mean (sd)" = ~ qwraps2::mean_sd(wt)),
"Forward Gears" =
list("Three" = ~ qwraps2::n_perc0(gear == 3),
"Four" = ~ qwraps2::n_perc0(gear == 4),
"Five" = ~ qwraps2::n_perc0(gear == 5))
)
Building the table is done with a call to summary_table
:
mtcars2 (N = 32) | |
---|---|
Miles Per Gallon | |
min | 10.4 |
max | 33.9 |
mean (sd) | 20.09 ± 6.03 |
Displacement | |
min | 71.1 |
max | 472 |
mean (sd) | 230.72 ± 123.94 |
Weight (1000 lbs) | |
min | 1.513 |
max | 5.424 |
mean (sd) | 3.22 ± 0.98 |
Forward Gears | |
Three | 15 (47) |
Four | 12 (38) |
Five | 5 (16) |
mtcars2 (N = 32) | |
---|---|
Miles Per Gallon | |
min | 10.4 |
max | 33.9 |
mean (sd) | 20.09 ± 6.03 |
Displacement | |
min | 71.1 |
max | 472 |
mean (sd) | 230.72 ± 123.94 |
Weight (1000 lbs) | |
min | 1.513 |
max | 5.424 |
mean (sd) | 3.22 ± 0.98 |
Forward Gears | |
Three | 15 (47) |
Four | 12 (38) |
Five | 5 (16) |
cyl_factor: 6 cylinders (N = 7) | cyl_factor: 4 cylinders (N = 11) | cyl_factor: 8 cylinders (N = 14) | |
---|---|---|---|
Miles Per Gallon | |||
min | 17.8 | 21.4 | 10.4 |
max | 21.4 | 33.9 | 19.2 |
mean (sd) | 19.74 ± 1.45 | 26.66 ± 4.51 | 15.10 ± 2.56 |
Displacement | |||
min | 145.0 | 71.1 | 275.8 |
max | 258.0 | 146.7 | 472.0 |
mean (sd) | 183.31 ± 41.56 | 105.14 ± 26.87 | 353.10 ± 67.77 |
Weight (1000 lbs) | |||
min | 2.620 | 1.513 | 3.170 |
max | 3.460 | 3.190 | 5.424 |
mean (sd) | 3.12 ± 0.36 | 2.29 ± 0.57 | 4.00 ± 0.76 |
Forward Gears | |||
Three | 2 (29) | 1 (9) | 12 (86) |
Four | 4 (57) | 8 (73) | 0 (0) |
Five | 1 (14) | 2 (18) | 2 (14) |
If you want to change the column names, do so via the cnames
argument to qable
via the print method for qwraps2_summary_table
objects. Any argument that you want to send to qable
can be sent there when explicitly using the print
method for qwraps2_summary_table
objects.
print(summary_table(dplyr::group_by(mtcars2, cyl_factor), our_summary1),
rtitle = "Summary Statistics",
cnames = c("Col 1", "Col 2", "Col 3"))
Summary Statistics | Col 1 | Col 2 | Col 3 |
---|---|---|---|
Miles Per Gallon | |||
min | 17.8 | 21.4 | 10.4 |
max | 21.4 | 33.9 | 19.2 |
mean (sd) | 19.74 ± 1.45 | 26.66 ± 4.51 | 15.10 ± 2.56 |
Displacement | |||
min | 145.0 | 71.1 | 275.8 |
max | 258.0 | 146.7 | 472.0 |
mean (sd) | 183.31 ± 41.56 | 105.14 ± 26.87 | 353.10 ± 67.77 |
Weight (1000 lbs) | |||
min | 2.620 | 1.513 | 3.170 |
max | 3.460 | 3.190 | 5.424 |
mean (sd) | 3.12 ± 0.36 | 2.29 ± 0.57 | 4.00 ± 0.76 |
Forward Gears | |||
Three | 2 (29) | 1 (9) | 12 (86) |
Four | 4 (57) | 8 (73) | 0 (0) |
Five | 1 (14) | 2 (18) | 2 (14) |
The task of building the summaries
list-of-lists can be tedious. tab_summary
is designed to make it easier. For numeric
variables, tab_summary
will provide the formula
e for the min, median (iqr), mean (sd), and max. factor
and character
vectors will have calls to qwraps2::n_perc
for all levels provided.
For version 0.2.3.9000 or beyond, arguments have been added to tab_summary
to help control some of the formatting of counts and percentages. The original behavior of tab_summary
used n_perc0
to format the summary of categorical variables. Now, n_perc
is called and the end user can specify formatting options via a list
passed via the n_perc_args
argument. The default settings for tab_summary
is below.
args(tab_summary)
## function (x, n_perc_args = list(digits = 0, show_symbol = FALSE),
## envir = parent.frame())
## NULL
These options will make the output look as if n_perc0
had been called instead of n_perc
. More importantly, these defaults will not honor the options()$qwraps2_frmt_digits
.
Examples for tab_summary
follow:
tab_summary(mtcars2$mpg)
## $min
## ~min(mtcars2$mpg)
##
## $`median (IQR)`
## ~qwraps2::median_iqr(mtcars2$mpg)
##
## $`mean (sd)`
## ~qwraps2::mean_sd(mtcars2$mpg)
##
## $max
## ~max(mtcars2$mpg)
tab_summary(mtcars2$gear) # gear is a numeric vector!
## $min
## ~min(mtcars2$gear)
##
## $`median (IQR)`
## ~qwraps2::median_iqr(mtcars2$gear)
##
## $`mean (sd)`
## ~qwraps2::mean_sd(mtcars2$gear)
##
## $max
## ~max(mtcars2$gear)
tab_summary(factor(mtcars2$gear))
## $`3`
## ~qwraps2::n_perc(factor(mtcars2$gear) == "3", digits = 0, show_symbol = FALSE)
##
## $`4`
## ~qwraps2::n_perc(factor(mtcars2$gear) == "4", digits = 0, show_symbol = FALSE)
##
## $`5`
## ~qwraps2::n_perc(factor(mtcars2$gear) == "5", digits = 0, show_symbol = FALSE)
The our_summary1
object can be recreated as follows. Some additional row groups are provided to show default behavior of tab_summary
. Important: Note that the tab_summary
are made while using with
. Further explanation for this follows.
our_summary2 <-
with(mtcars2,
list("Miles Per Gallon" = tab_summary(mpg)[c(1, 4, 3)],
"Displacement (default summary)" = tab_summary(disp),
"Displacement" = c(tab_summary(disp)[c(1, 4, 3)],
"mean (95% CI)" = ~ frmtci(qwraps2::mean_ci(disp))),
"Weight (1000 lbs)" = tab_summary(wt)[c(1, 4, 3)],
"Forward Gears" = tab_summary(as.character(gear))
))
mtcars2 (N = 32) | |
---|---|
Miles Per Gallon | |
min | 10.4 |
max | 33.9 |
mean (sd) | 20.09 ± 6.03 |
Displacement (default summary) | |
min | 71.1 |
median (IQR) | 196.30 (120.83, 326.00) |
mean (sd) | 230.72 ± 123.94 |
max | 472 |
Displacement | |
min | 71.1 |
max | 472 |
mean (sd) | 230.72 ± 123.94 |
mean (95% CI) | 230.72 (187.78, 273.66) |
Weight (1000 lbs) | |
min | 1.513 |
max | 5.424 |
mean (sd) | 3.22 ± 0.98 |
Forward Gears | |
3 | 15 (47) |
4 | 12 (38) |
5 | 5 (16) |
Group by multiple factors:
am: 0 vs: 0 (N = 12) | am: 0 vs: 1 (N = 7) | am: 1 vs: 0 (N = 6) | am: 1 vs: 1 (N = 7) | |
---|---|---|---|---|
Miles Per Gallon | ||||
min | 10.4 | 17.8 | 15.0 | 21.4 |
max | 19.2 | 24.4 | 26.0 | 33.9 |
mean (sd) | 15.05 ± 2.77 | 20.74 ± 2.47 | 19.75 ± 4.01 | 28.37 ± 4.76 |
Displacement (default summary) | ||||
min | 275.8 | 120.1 | 120.3 | 71.1 |
median (IQR) | 355.00 (296.95, 410.00) | 167.60 (143.75, 196.30) | 160.00 (148.75, 265.75) | 79.00 (77.20, 101.55) |
mean (sd) | 357.62 ± 71.82 | 175.11 ± 49.13 | 206.22 ± 95.23 | 89.80 ± 18.80 |
max | 472 | 258 | 351 | 121 |
Displacement | ||||
min | 275.8 | 120.1 | 120.3 | 71.1 |
max | 472 | 258 | 351 | 121 |
mean (sd) | 357.62 ± 71.82 | 175.11 ± 49.13 | 206.22 ± 95.23 | 89.80 ± 18.80 |
mean (95% CI) | 357.62 (316.98, 398.25) | 175.11 (138.72, 211.51) | 206.22 (130.02, 282.42) | 89.80 (75.87, 103.73) |
Weight (1000 lbs) | ||||
min | 3.435 | 2.465 | 2.140 | 1.513 |
max | 5.424 | 3.460 | 3.570 | 2.780 |
mean (sd) | 4.10 ± 0.77 | 3.19 ± 0.35 | 2.86 ± 0.49 | 2.03 ± 0.44 |
Forward Gears | ||||
3 | 12 (100) | 3 (43) | 0 (0) | 0 (0) |
4 | 0 (0) | 4 (57) | 2 (33) | 6 (86) |
5 | 0 (0) | 0 (0) | 4 (67) | 1 (14) |
As one table:
mtcars2 (N = 32) | am: 0 vs: 0 (N = 12) | am: 0 vs: 1 (N = 7) | am: 1 vs: 0 (N = 6) | am: 1 vs: 1 (N = 7) | |
---|---|---|---|---|---|
Miles Per Gallon | |||||
min | 10.4 | 10.4 | 17.8 | 15.0 | 21.4 |
max | 33.9 | 19.2 | 24.4 | 26.0 | 33.9 |
mean (sd) | 20.09 ± 6.03 | 15.05 ± 2.77 | 20.74 ± 2.47 | 19.75 ± 4.01 | 28.37 ± 4.76 |
Displacement (default summary) | |||||
min | 71.1 | 275.8 | 120.1 | 120.3 | 71.1 |
median (IQR) | 196.30 (120.83, 326.00) | 355.00 (296.95, 410.00) | 167.60 (143.75, 196.30) | 160.00 (148.75, 265.75) | 79.00 (77.20, 101.55) |
mean (sd) | 230.72 ± 123.94 | 357.62 ± 71.82 | 175.11 ± 49.13 | 206.22 ± 95.23 | 89.80 ± 18.80 |
max | 472 | 472 | 258 | 351 | 121 |
Displacement | |||||
min | 71.1 | 275.8 | 120.1 | 120.3 | 71.1 |
max | 472 | 472 | 258 | 351 | 121 |
mean (sd) | 230.72 ± 123.94 | 357.62 ± 71.82 | 175.11 ± 49.13 | 206.22 ± 95.23 | 89.80 ± 18.80 |
mean (95% CI) | 230.72 (187.78, 273.66) | 357.62 (316.98, 398.25) | 175.11 (138.72, 211.51) | 206.22 (130.02, 282.42) | 89.80 (75.87, 103.73) |
Weight (1000 lbs) | |||||
min | 1.513 | 3.435 | 2.465 | 2.140 | 1.513 |
max | 5.424 | 5.424 | 3.460 | 3.570 | 2.780 |
mean (sd) | 3.22 ± 0.98 | 4.10 ± 0.77 | 3.19 ± 0.35 | 2.86 ± 0.49 | 2.03 ± 0.44 |
Forward Gears | |||||
3 | 15 (47) | 12 (100) | 3 (43) | 0 (0) | 0 (0) |
4 | 12 (38) | 0 (0) | 4 (57) | 2 (33) | 6 (86) |
5 | 5 (16) | 0 (0) | 0 (0) | 4 (67) | 1 (14) |
There are many, many different ways to format data summary tables. Adding p-values to a table is just one thing that can be done in more than one way. For example, if a row group reports the counts and percentages for each level of a categorical variable across multiple (column) groups, then I would argue that the p-value resulting from a chi square test or a Fisher exact test would be best placed on the line of the table labeling the row group. However, say we reported the minimum, median, mean, and maximum with in a row group for one variable. The p-value from a t-test, or other meaningful test, for the difference in mean I would suggest should be reported on the line of the summary table for the mean, not the row group itself.
With so many possibilities I have reserved construction of a p-value column to be ad hoc. Perhaps an additional column wouldn’t be used and the p-values are edited into row group labels, for example.
If you want to add a p-value column to a qwraps2_summary_table
object you can with some degree of ease. Note that qwraps2_summary_table
objects are just character matrices.
both %>% str
## 'qwraps2_summary_table' chr [1:17, 1:5] "10.4" "33.9" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:17] "min" "max" "mean (sd)" "min" ...
## ..$ : chr [1:5] "mtcars2 (N = 32)" "am: 0 vs: 0 (N = 12)" "am: 0 vs: 1 (N = 7)" "am: 1 vs: 0 (N = 6)" ...
## - attr(*, "rgroups")= Named int [1:5] 3 4 4 3 3
## ..- attr(*, "names")= chr [1:5] "Miles Per Gallon" "Displacement (default summary)" "Displacement" "Weight (1000 lbs)" ...
# another good way to veiw the character matrix
# print.default(both)
Let’s added p-values for testing the difference in the mean between the four groups defined by am:vs
.
pvals <-
list(lm(mpg ~ am:vs, data = mtcars2),
lm(disp ~ am:vs, data = mtcars2),
lm(disp ~ am:vs, data = mtcars2), # yeah, silly example this is needed twice
lm(wt ~ am:vs, data = mtcars2)) %>%
lapply(aov) %>%
lapply(summary) %>%
lapply(function(x) x[[1]][["Pr(>F)"]][1]) %>%
lapply(frmtp) %>%
do.call(c, .)
pvals
## [1] "*P* < 0.0001" "*P* = 0.0002" "*P* = 0.0002" "*P* < 0.0001"
Adding the p-value column is done as follows:
both <- cbind(both, "P-value" = "")
both[grepl("mean \\(sd\\)", rownames(both)), "P-value"] <- pvals
and the resulting table is:
mtcars2 (N = 32) | am: 0 vs: 0 (N = 12) | am: 0 vs: 1 (N = 7) | am: 1 vs: 0 (N = 6) | am: 1 vs: 1 (N = 7) | P-value | |
---|---|---|---|---|---|---|
Miles Per Gallon | ||||||
min | 10.4 | 10.4 | 17.8 | 15.0 | 21.4 | |
max | 33.9 | 19.2 | 24.4 | 26.0 | 33.9 | |
mean (sd) | 20.09 ± 6.03 | 15.05 ± 2.77 | 20.74 ± 2.47 | 19.75 ± 4.01 | 28.37 ± 4.76 | P < 0.0001 |
Displacement (default summary) | ||||||
min | 71.1 | 275.8 | 120.1 | 120.3 | 71.1 | |
median (IQR) | 196.30 (120.83, 326.00) | 355.00 (296.95, 410.00) | 167.60 (143.75, 196.30) | 160.00 (148.75, 265.75) | 79.00 (77.20, 101.55) | |
mean (sd) | 230.72 ± 123.94 | 357.62 ± 71.82 | 175.11 ± 49.13 | 206.22 ± 95.23 | 89.80 ± 18.80 | P = 0.0002 |
max | 472 | 472 | 258 | 351 | 121 | |
Displacement | ||||||
min | 71.1 | 275.8 | 120.1 | 120.3 | 71.1 | |
max | 472 | 472 | 258 | 351 | 121 | |
mean (sd) | 230.72 ± 123.94 | 357.62 ± 71.82 | 175.11 ± 49.13 | 206.22 ± 95.23 | 89.80 ± 18.80 | P = 0.0002 |
mean (95% CI) | 230.72 (187.78, 273.66) | 357.62 (316.98, 398.25) | 175.11 (138.72, 211.51) | 206.22 (130.02, 282.42) | 89.80 (75.87, 103.73) | |
Weight (1000 lbs) | ||||||
min | 1.513 | 3.435 | 2.465 | 2.140 | 1.513 | |
max | 5.424 | 5.424 | 3.460 | 3.570 | 2.780 | |
mean (sd) | 3.22 ± 0.98 | 4.10 ± 0.77 | 3.19 ± 0.35 | 2.86 ± 0.49 | 2.03 ± 0.44 | P < 0.0001 |
Forward Gears | ||||||
3 | 15 (47) | 12 (100) | 3 (43) | 0 (0) | 0 (0) | |
4 | 12 (38) | 0 (0) | 4 (57) | 2 (33) | 6 (86) | |
5 | 5 (16) | 0 (0) | 0 (0) | 4 (67) | 1 (14) |
with
with tab_summary
?tab_summary
was written to help construct formula
e to save the end user key strokes. There are plenty of reasons for summary_table
to be used without tab_summary
. However, when it is helpful to use tab_summary
make sure you understand the results.
For example, let’s look at a simple summary of the miles per gallon.
# tab_summary(mpg) ## this errors
tab_summary(mtcars$mpg)
## $min
## ~min(mtcars$mpg)
##
## $`median (IQR)`
## ~qwraps2::median_iqr(mtcars$mpg)
##
## $`mean (sd)`
## ~qwraps2::mean_sd(mtcars$mpg)
##
## $max
## ~max(mtcars$mpg)
with(mtcars, tab_summary(mpg))
## $min
## ~min(mpg)
## <environment: 0x7fb242f02608>
##
## $`median (IQR)`
## ~qwraps2::median_iqr(mpg)
## <environment: 0x7fb242f02608>
##
## $`mean (sd)`
## ~qwraps2::mean_sd(mpg)
## <environment: 0x7fb242f02608>
##
## $max
## ~max(mpg)
## <environment: 0x7fb242f02608>
The first call errors because mpg
is not in the global environment. The difference between the second and third calls is subtle. The second call generates a formula
with mtcars$mpg
as an argument whereas the third call generates a formula
with only mpg
as the argument. The difference will be seen in the summary tables if the .data
is subsetted.
# The same tables:
summary_table(mtcars, list("MPG 1" = with(mtcars, tab_summary(mpg))))
##
##
## | |mtcars (N = 32) |
## |:-------------------------|:--------------------|
## |**MPG 1** | |
## | min |10.4 |
## | median (IQR) |19.20 (15.43, 22.80) |
## | mean (sd) |20.09 ± 6.03 |
## | max |33.9 |
summary_table(mtcars, list("MPG 2" = tab_summary(mtcars$mpg)))
##
##
## | |mtcars (N = 32) |
## |:-------------------------|:--------------------|
## |**MPG 2** | |
## | min |10.4 |
## | median (IQR) |19.20 (15.43, 22.80) |
## | mean (sd) |20.09 ± 6.03 |
## | max |33.9 |
These two calls generate the same table because the .data
and the implied data within the second call are the same.
# Different tables
summary_table(dplyr::filter(mtcars, am == 0), list("MPG 3" = with(mtcars, tab_summary(mpg))))
dplyr::filter(mtcars, am == 0) (N = 19) | |
---|---|
MPG 3 | |
min | 10.4 |
median (IQR) | 17.30 (14.95, 19.20) |
mean (sd) | 17.15 ± 3.83 |
max | 24.4 |
dplyr::filter(mtcars, am == 0) (N = 19) | |
---|---|
MPG 4 | |
min | 10.4 |
median (IQR) | 19.20 (15.43, 22.80) |
mean (sd) | 20.09 ± 6.03 |
max | 33.9 |
Now, the result of the second call above is not correct, it is the same as for the first two calls. This is because mtcars$
is part of the formula
and the .data
is ignored. The correct result is in the table with MPG 3
.
I encourage you, the end user, to use summary_table
primarily, and use tab_summary
as a quick tool for generating a script. It might be best if you use tab_summary
to generate a template of the formula
e you will want, copy the template into your script and edit accordingly.
print(sessionInfo(), local = FALSE)
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-apple-darwin17.3.0 (64-bit)
## Running under: macOS High Sierra 10.13.4
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 dplyr_0.7.4 qwraps2_0.3.0 knitr_1.20
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.16 highr_0.6 plyr_1.8.4
## [4] compiler_3.4.4 pillar_1.2.1 RColorBrewer_1.1-2
## [7] influenceR_0.1.0 bindr_0.1.1 viridis_0.5.1
## [10] tools_3.4.4 digest_0.6.15 jsonlite_1.5
## [13] viridisLite_0.3.0 gtable_0.2.0 evaluate_0.10.1
## [16] memoise_1.1.0 tibble_1.4.2 rgexf_0.15.3
## [19] pkgconfig_2.0.1 rlang_0.2.0.9001 igraph_1.2.1
## [22] rstudioapi_0.7 yaml_2.1.18 gridExtra_2.3
## [25] downloader_0.4 withr_2.1.2 DiagrammeR_1.0.0
## [28] stringr_1.3.0 htmlwidgets_1.0 devtools_1.13.5
## [31] hms_0.4.2 grid_3.4.4 rprojroot_1.3-2
## [34] data.tree_0.7.5 glue_1.2.0 R6_2.2.2
## [37] Rook_1.1-1 XML_3.98-1.10 rmarkdown_1.9
## [40] ggplot2_2.2.1.9000 tidyr_0.8.0 purrr_0.2.4
## [43] readr_1.1.1 magrittr_1.5 backports_1.1.2
## [46] scales_0.5.0.9000 htmltools_0.3.6 assertthat_0.2.0
## [49] colorspace_1.3-2 brew_1.0-6 stringi_1.1.7
## [52] visNetwork_2.0.3 lazyeval_0.2.1 munsell_0.4.3