# 1. Overview

summarytools provides a coherent set of functions centered on data exploration and simple reporting. At its core reside the following four functions:

Function Description
freq() Frequency Tables featuring counts, proportions, as well as missing data information
ctable() Cross-Tabulations (joint frequencies) between pairs of discrete/categorical variables, featuring marginal sums as well as row, column or total proportions
descr() Descriptive (Univariate) Statistics for numerical data, featuring common measures of central tendency and dispersion
dfSummary()   Extensive Data Frame Summaries featuring type-specific information for all variables in a data frame: univariate statistics and/or frequency distributions, bar charts or histograms, as well as missing data counts and proportions. Very useful to quickly detect anomalies and identify trends at a glance

## 1.1 Motivation

The package was developed with the following objectives in mind:

• Provide a coherent set of easy to use descriptive functions that are akin to those included in commercial statistical packages
• Offer flexibility in terms of output formats and contents
• Integrate well with software and tools commonly used for reporting (the RStudio IDE, Rmarkdown, and knitr) while also allowing for stand-alone, simple report generation

## 1.2 Redirecting Outputs

Results can be

• Displayed in the R console as plain text
• Rendered as html and shown in RStudio’s Viewer or in a Web Browser
• Written to / appended to plain text, markdown, or html files
• Used in Rmarkdown reports

## 1.3 Other Characteristics

• Pipe-Friendly:
• The %>% and %$% operators from the magrittr package are supported • The %>>% operator from the pipeR package is also supported • Multilingual: • Built-in translations exist for French, Portuguese, Spanish, Russian and Turkish • Users can easily add custom translations or modify existing sets of translations as needed • Weights-ready: except for dfSummary(), all core functions support sampling weights • Flexible: • Default values for most function arguments can be modified using st_options(); this simplifies coding and minimizes redundancy • Pander options can be used for text / markdown tables • Bootstrap and user-defined CSS classes can be used for html tables # 2. The Four Core Functions ## 2.1 Frequency Tables With freq() The freq() function generates frequency tables with counts, proportions, as well as missing data information. freq(iris$Species, plain.ascii = FALSE, style = "rmarkdown")

iris$Species Type: Factor Freq % Valid % Valid Cum. % Total % Total Cum. setosa 50 33.33 33.33 33.33 33.33 versicolor 50 33.33 66.67 33.33 66.67 virginica 50 33.33 100.00 33.33 100.00 <NA> 0 0.00 100.00 Total 150 100.00 100.00 100.00 100.00 In this first example, the plain.ascii and style arguments were specified. However, since we have defined them globally with st_options() in the setup chunk, they are redundant and will be omitted from hereon. See section 13 for more details on this vignette’s setup. ### 2.1.1 Ignoring Missing Data The report.nas argument can be set to FALSE in order to ignore missing values (NA’s). Doing so has the following effects on the resulting table: 1. The <NA> row is omitted 2. The % Total and % Total Cum. columns are also omitted 3. The % Valid column simply becomes % 4. The % Valid Cum. column simply becomes % Cum. freq(iris$Species, report.nas = FALSE, headings = FALSE)
Freq % % Cum.
setosa 50 33.33 33.33
versicolor 50 33.33 66.67
virginica 50 33.33 100.00
Total 150 100.00 100.00

Note that the headings = FALSE parameter suppresses the heading section.

### 2.1.2 Minimal Frequency Tables

By “switching off” all optional elements, a much simpler table will be produced:

tobacco$disease Type: Character Freq % Valid % Valid Cum. % Total % Total Cum. Hypertension 36 16.22 16.22 3.60 3.60 Cancer 34 15.32 31.53 3.40 7.00 Cholesterol 21 9.46 40.99 2.10 9.10 Heart 20 9.01 50.00 2.00 11.10 Pulmonary 20 9.01 59.01 2.00 13.10 (Other) 91 40.99 100.00 9.10 22.20 <NA> 778 77.80 100.00 Total 1000 100.00 100.00 100.00 100.00 Instead of "freq", we can use "-freq" to reverse the ordering and get results ranked from lowest to highest in frequency. To account for the frequencies of unshown values, the “(Other)” row is automatically added. ### 2.1.5 Collapsible Sections When generating html results, use the collapse = TRUE argument with print() or view() to get collapsible sections; clicking on the variable name in the heading section will collapse / reveal the frequency table (results not shown). view(freq(tobacco), collapse = TRUE) ## 2.2 Cross-Tabulations with ctable() ctable() generates cross-tabulations (joint frequencies) for pairs of categorical variables. Since markdown does not support multiline table headings (but does accept html code), we’ll use the html rendering feature for this section. Using the tobacco data frame, we’ll cross-tabulate the two categorical variables smoker and diseased. print(ctable(x = tobacco$smoker, y = tobacco$diseased, prop = "r"), method = "render") ### Cross-Tabulation, Row Proportions smoker * diseased Data Frame: tobacco diseased smoker Yes No Total Yes 125 ( 41.9% ) 173 ( 58.1% ) 298 ( 100.0% ) No 99 ( 14.1% ) 603 ( 85.9% ) 702 ( 100.0% ) Total 224 ( 22.4% ) 776 ( 77.6% ) 1000 ( 100.0% ) ### 2.2.1 Row, Column or Total Proportions Row proportions are shown by default. To display column or total proportions, use prop = "c" or prop = "t", respectively. To omit proportions altogether, use prop = "n". ### 2.2.2 Minimal Cross-Tabulations By “switching off” all optional features, we get a simple “2 x 2” table: with(tobacco, print(ctable(x = smoker, y = diseased, prop = 'n', totals = FALSE, headings = FALSE), method = "render")) diseased smoker Yes No Yes 125 173 No 99 603 ### 2.2.3 Chi-Square (𝛘2), Odds Ratio and Risk Ratio To display the chi-square statistic, set chisq = TRUE. For 2 x 2 tables, use OR and RR to show odds ratio and risk ratio (also called relative risk), respectively. Those can be set to TRUE, in which case 95% confidence intervals will be shown; to use alternate confidence levels, use for example OR = .90. To show how pipes can be used with summarytools, we’ll use magrittr’s %$% and %>% operators:

library(magrittr)

# 3. Grouped Statistics Using stby()

To produce optimal results, summarytools has its own version of the base by() function. It’s called stby(), and we use it exactly as we would by():

(iris_stats_by_species <- stby(data = iris,
INDICES = iris$Species, FUN = descr, stats = "common", transpose = TRUE)) Non-numerical variable(s) ignored: Species ### Descriptive Statistics iris Group: Species = setosa N: 50 Mean Std.Dev Min Median Max N.Valid Pct.Valid Petal.Length 1.46 0.17 1.00 1.50 1.90 50.00 100.00 Petal.Width 0.25 0.11 0.10 0.20 0.60 50.00 100.00 Sepal.Length 5.01 0.35 4.30 5.00 5.80 50.00 100.00 Sepal.Width 3.43 0.38 2.30 3.40 4.40 50.00 100.00 Group: Species = versicolor N: 50 Mean Std.Dev Min Median Max N.Valid Pct.Valid Petal.Length 4.26 0.47 3.00 4.35 5.10 50.00 100.00 Petal.Width 1.33 0.20 1.00 1.30 1.80 50.00 100.00 Sepal.Length 5.94 0.52 4.90 5.90 7.00 50.00 100.00 Sepal.Width 2.77 0.31 2.00 2.80 3.40 50.00 100.00 Group: Species = virginica N: 50 Mean Std.Dev Min Median Max N.Valid Pct.Valid Petal.Length 5.55 0.55 4.50 5.55 6.90 50.00 100.00 Petal.Width 2.03 0.27 1.40 2.00 2.50 50.00 100.00 Sepal.Length 6.59 0.64 4.90 6.50 7.90 50.00 100.00 Sepal.Width 2.97 0.32 2.20 3.00 3.80 50.00 100.00 ## 3.1 Special Case of descr() with stby() When used to produce split-group statistics for a single variable, stby() assembles everything into a single table instead of displaying a series of one-column tables. with(tobacco, stby(data = BMI, INDICES = age.gr, FUN = descr, stats = c("mean", "sd", "min", "med", "max"))) ### Descriptive Statistics BMI by age.gr Data Frame: tobacco N: 258 18-34 35-50 51-70 71 + Mean 23.84 25.11 26.91 27.45 Std.Dev 4.23 4.34 4.26 4.37 Min 8.83 10.35 9.01 16.36 Median 24.04 25.11 26.77 27.52 Max 34.84 39.44 39.21 38.37 ## 3.2 Using stby() With ctable() The syntax is a little trickier for this one, so here is an example (results not shown): stby(list(x = tobacco$smoker, y = tobacco$diseased), INDICES = tobacco$gender, FUN = ctable)

# or equivalently
with(tobacco,
stby(list(x = smoker, y = diseased),
INDICES = gender, FUN = ctable))

# 4. Grouped Statistics Using dplyr::group_by()

To create grouped statistics with freq(), descr() or dfSummary(), it is possible to use dplyr’s group_by() as an alternative to stby(). Syntactic differences aside, one key distinction is that group_by() considers NA values on the grouping variables as a valid category, albeit with a warning message suggesting the use of forcats::fct_explicit_na to make NA’s explicit in factors. Following this advice, we get:

library(dplyr)
tobacco$gender %<>% forcats::fct_explicit_na() tobacco %>% group_by(gender) %>% descr(stats = "fivenum") Non-numerical variable(s) ignored: age.gr, smoker, diseased, disease ### Descriptive Statistics tobacco Group: gender = F N: 489 BMI age cigs.per.day samp.wgts Min 9.01 18.00 0.00 0.86 Q1 22.98 34.00 0.00 0.86 Median 25.87 50.00 0.00 1.04 Q3 29.48 66.00 10.50 1.05 Max 39.44 80.00 40.00 1.06 Group: gender = M N: 489 BMI age cigs.per.day samp.wgts Min 8.83 18.00 0.00 0.86 Q1 22.52 34.00 0.00 0.86 Median 25.14 49.50 0.00 1.04 Q3 27.96 66.00 11.00 1.05 Max 36.76 80.00 40.00 1.06 Group: gender = (Missing) N: 22 BMI age cigs.per.day samp.wgts Min 20.24 19.00 0.00 0.86 Q1 24.97 36.00 0.00 1.04 Median 27.16 55.50 0.00 1.05 Q3 30.23 64.00 10.00 1.05 Max 32.43 80.00 28.00 1.06 # 5. Creating Tidy Tables With tb() When generating freq() or descr() tables, it is possible to turn the results into “tidy” tables with the use of the tb() function (think of tb as a diminutive for tibble). For example: library(magrittr) iris %>% descr(stats = "common") %>% tb() # A tibble: 4 x 8 variable mean sd min med max n.valid pct.valid <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Petal.Length 3.76 1.77 1 4.35 6.9 150 100 2 Petal.Width 1.20 0.762 0.1 1.3 2.5 150 100 3 Sepal.Length 5.84 0.828 4.3 5.8 7.9 150 100 4 Sepal.Width 3.06 0.436 2 3 4.4 150 100 iris$Species %>% freq(cumul = FALSE, report.nas = FALSE) %>% tb()
# A tibble: 3 x 3
Species     freq   pct
<fct>      <dbl> <dbl>
1 setosa        50  33.3
2 versicolor    50  33.3
3 virginica     50  33.3

By definition, no total rows are part of tidy tables, and the row names are converted to a regular column. Note that for displaying tibbles using Rmarkdown, the knitr chunk option ‘results’ should be set to “markup” instead of “asis”.

## 5.1 Tidy Split-Group Statistics

Here are some examples showing how lists created using stby() or group_by() can be transformed into tidy tibbles.

grouped_descr <- stby(data = exams, INDICES = exams$gender, FUN = descr, stats = "common") grouped_descr %>% tb() # A tibble: 12 x 9 gender variable mean sd min med max n.valid pct.valid <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Girl economics 72.5 7.79 62.3 70.2 89.6 14 93.3 2 Girl english 73.9 9.41 58.3 71.8 93.1 14 93.3 3 Girl french 71.1 12.4 44.8 68.4 93.7 14 93.3 4 Girl geography 67.3 8.26 50.4 67.3 78.9 15 100 5 Girl history 71.2 9.17 53.9 72.9 86.4 15 100 6 Girl math 73.8 9.03 55.6 74.8 86.3 14 93.3 7 Boy economics 75.2 9.40 60.5 71.7 94.2 15 100 8 Boy english 77.8 5.94 69.6 77.6 90.2 15 100 9 Boy french 76.6 8.63 63.2 74.8 94.7 15 100 10 Boy geography 73 12.4 47.2 71.2 96.3 14 93.3 11 Boy history 74.4 11.2 54.4 72.6 93.5 15 100 12 Boy math 73.3 9.68 60.5 72.2 93.2 14 93.3 The order parameter controls row ordering: grouped_descr %>% tb(order = 2) # A tibble: 12 x 9 gender variable mean sd min med max n.valid pct.valid <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Girl economics 72.5 7.79 62.3 70.2 89.6 14 93.3 2 Boy economics 75.2 9.40 60.5 71.7 94.2 15 100 3 Girl english 73.9 9.41 58.3 71.8 93.1 14 93.3 4 Boy english 77.8 5.94 69.6 77.6 90.2 15 100 5 Girl french 71.1 12.4 44.8 68.4 93.7 14 93.3 6 Boy french 76.6 8.63 63.2 74.8 94.7 15 100 7 Girl geography 67.3 8.26 50.4 67.3 78.9 15 100 8 Boy geography 73 12.4 47.2 71.2 96.3 14 93.3 9 Girl history 71.2 9.17 53.9 72.9 86.4 15 100 10 Boy history 74.4 11.2 54.4 72.6 93.5 15 100 11 Girl math 73.8 9.03 55.6 74.8 86.3 14 93.3 12 Boy math 73.3 9.68 60.5 72.2 93.2 14 93.3 Setting order = 3 changes the order of the sort variables exactly as with order = 2, but it also reorders the columns: grouped_descr %>% tb(order = 3) # A tibble: 12 x 9 variable gender mean sd min med max n.valid pct.valid <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 economics Girl 72.5 7.79 62.3 70.2 89.6 14 93.3 2 economics Boy 75.2 9.40 60.5 71.7 94.2 15 100 3 english Girl 73.9 9.41 58.3 71.8 93.1 14 93.3 4 english Boy 77.8 5.94 69.6 77.6 90.2 15 100 5 french Girl 71.1 12.4 44.8 68.4 93.7 14 93.3 6 french Boy 76.6 8.63 63.2 74.8 94.7 15 100 7 geography Girl 67.3 8.26 50.4 67.3 78.9 15 100 8 geography Boy 73 12.4 47.2 71.2 96.3 14 93.3 9 history Girl 71.2 9.17 53.9 72.9 86.4 15 100 10 history Boy 74.4 11.2 54.4 72.6 93.5 15 100 11 math Girl 73.8 9.03 55.6 74.8 86.3 14 93.3 12 math Boy 73.3 9.68 60.5 72.2 93.2 14 93.3 For more details, see ?tb. ## 5.2 A Bridge to Other Packages summarytools objects are not always compatible with packages focused on table formatting, such as formattable or kableExtra. However, tb() can be used as a “bridge”, an intermediary step turning freq() and descr() objects into simple tables that any package can work with. Here is an example using kableExtra: library(kableExtra) library(magrittr) stby(iris, iris$Species, descr, stats = "fivenum") %>%
tb(order = 3) %>%
kable(format = "html", digits = 2) %>%
collapse_rows(columns = 1, valign = "top")
variable Species min q1 med q3 max
Petal.Length setosa 1.0 1.4 1.50 1.6 1.9
versicolor 3.0 4.0 4.35 4.6 5.1
virginica 4.5 5.1 5.55 5.9 6.9
Petal.Width setosa 0.1 0.2 0.20 0.3 0.6
versicolor 1.0 1.2 1.30 1.5 1.8
virginica 1.4 1.8 2.00 2.3 2.5
Sepal.Length setosa 4.3 4.8 5.00 5.2 5.8
versicolor 4.9 5.6 5.90 6.3 7.0
virginica 4.9 6.2 6.50 6.9 7.9
Sepal.Width setosa 2.3 3.2 3.40 3.7 4.4
versicolor 2.0 2.5 2.80 3.0 3.4
virginica 2.2 2.8 3.00 3.2 3.8

# 6. Redirecting Output to Files

Using the file argument with print() or view(), we can write outputs to a file, be it html, Rmd, md, or just plain text (txt). The file extension is used to determine the type of content to write out.

view(iris_stats_by_species, file = "~/iris_stats_by_species.html")
view(iris_stats_by_species, file = "~/iris_stats_by_species.md")

A Note About PDF documents

There is no direct way to create a PDF file with summarytools. One option is to generate an html file and convert it to PDF using Pandoc or WK<html>TOpdf (the latter gives better results than Pandoc with dfSummary() output). Another option is to create an Rmd document using PDF as the output format, but with a caveat: displaying graphs with dfSummary() will cause vertical misalignment (we hope to resolve this issue in a future version).

## 6.1 Appending Output Files

The append argument allows adding content to existing files generated by summarytools. This is useful if we wish to include several statistical tables in a single file. It is a quick alternative to creating an Rmd document.

# 7. Global options

The following options can be set with st_options():

## 7.1 General Options

Option name Default Note
style “simple” Set to “rmarkdown” in .Rmd documents
plain.ascii TRUE Set to FALSE in .Rmd documents
round.digits 2 Number of decimals to show
footnote “default” Personalize, or set to NA to omit
display.labels TRUE Show variable / data frame labels in headings
bootstrap.css (*) TRUE Include Bootstrap 4 CSS in html output files
custom.css NA Path to your own CSS file
escape.pipe FALSE Useful for some Pandoc conversions
subtitle.emphasis TRUE Controls headings formatting
lang “en” Language (always 2-letter, lowercase)

(*) Set to FALSE in Shiny apps

## 7.2 Function-Specific Options

Option name Default Note
freq.totals TRUE Display totals row in freq()
freq.report.nas TRUE Display row and “valid” columns
freq.silent FALSE Hide console messages
ctable.prop “r” Display row proportions by default
ctable.totals TRUE Show marginal totals
descr.stats “all” “fivenum”, “common” or vector of stats
descr.transpose FALSE Display stats in columns instead of rows
descr.silent FALSE Hide console messages
dfSummary.varnumbers TRUE Show variable numbers in 1st col.
dfSummary.labels.col TRUE Show variable labels when present
dfSummary.graph.col TRUE Show graphs
dfSummary.valid.col TRUE Include the Valid column in the output
dfSummary.na.col TRUE Include the Missing column in the output
dfSummary.graph.magnif 1 Zoom factor for bar plots and histograms
dfSummary.silent FALSE Hide console messages
tmp.img.dir NA Directory to store temporary images

Examples

st_options()                      # Display all global options values
st_options('round.digits')        # Display the value of a specific option
st_options(style = 'rmarkdown',   # Set the value of one or several options
footnote = NA)         # Turn off the footnote for all html output

# 8. Overriding Formatting Attributes

When a summarytools object is created, its formatting attributes are stored within it. However, we can override most of them when using print() or view().

## 8.1 Overriding Function-Specific Arguments

This table indicates what arguments can be used with print() or view() to override formatting attributes:

Argument freq ctable descr dfSummary
style x x x x
round.digits x x x
plain.ascii x x x x
justify x x x x
headings x x x x
display.labels x x x x
varnumbers x
labels.col x
graph.col x
valid.col x
na.col x
col.widths x
totals x x
report.nas x
display.type x
missing x
split.tables (*) x x x x
caption (*) x x x x

(*) These are pander options

## 8.2 Overriding Heading Contents

To change the information shown in the heading section, use the following arguments with print() or view():

Argument freq ctable descr dfSummary
Data.frame x x x x
Data.frame.label x x x x
Variable x x x
Variable.label x x x
Group x x x x
date x x x x
Weights x x
Data.type x
Row.variable x
Col.variable x

### Example

In the following example, we will override three formatting, and one heading attribute:

(age_stats <- freq(tobacco$age.gr))  ### Frequencies tobacco$age.gr
Type: Factor

Freq % Valid % Valid Cum. % Total % Total Cum.
18-34 258 26.46 26.46 25.80 25.80
35-50 241 24.72 51.18 24.10 49.90
51-70 317 32.51 83.69 31.70 81.60
71 + 159 16.31 100.00 15.90 97.50
<NA> 25 2.50 100.00
Total 1000 100.00 100.00 100.00 100.00
print(age_stats, report.nas = FALSE, totals = FALSE, display.type = FALSE,
Variable.label = "Age Group")

iris$Species Type: Facteur Fréq. % Valide % Valide cum. % Total % Total cum. setosa 50 33.33 33.33 33.33 33.33 versicolor 50 33.33 66.67 33.33 66.67 virginica 50 33.33 100.00 33.33 100.00 <NA> 0 0.00 100.00 Total 150 100.00 100.00 100.00 100.00 ## 12.1 Non-UTF-8 Locales On most Windows systems, it will be necessary to change the LC_CTYPE element of the locale settings if the character set is not included in the system’s default locale. For instance, in order to get good results with the Russian language in a “latin1” environment, we need to do the following: Sys.setlocale("LC_CTYPE", "russian") st_options(lang = 'ru') Then to go back to default settings: Sys.setlocale("LC_CTYPE", "") st_options(lang = "en") ## 12.2 Defining and Using Custom Translations Using the function use_custom_lang(), it is possible to add your own set of translations. To achieve this, get the csv template, customize the +/- 70 items, and call use_custom_lang(), giving it as sole argument the path to the edited csv template. Note that such custom translations will not persist across R sessions. This means that you should always have this csv file handy for future use. ## 12.3 Defining Specific Keywords Sometimes, all you might want to do is change just a few keywords – for instance, you could prefer using “N” instead of “Freq” in the title row of freq() tables. For this, use define_keywords(). Calling this function without any arguments will bring up, on systems that support graphical devices (the vast majority, that is), an editable window allowing to modify only the desired item(s). After closing the edit window, you will be able to export the resulting “custom language” into a csv file that you can reuse in the future by calling use_custom_lang(). It is also possible to programmatically define one or several keywords using define_keywords(). For instance: define_keywords(freq = "N") See ?define_keywords for more details. # 13. This Vignette’s Setup Knowing how this vignette is configured can help users get started with using summarytools in Rmarkdown documents. ### The yaml Section The output element is the one what matters: ## --- ## output: ## rmarkdown::html_vignette: ## css: ## - !expr system.file("rmarkdown/templates/html_vignette/resources/vignette.css", ## package = "rmarkdown") ## --- ### The Setup Chunk ## {r setup, include=FALSE} ## library(knitr) ## opts_chunk$set(results = 'asis',      # Can also be set at the chunk-level
##                comment = NA,
##                prompt  = FALSE,
##                cache   = FALSE)
## library(summarytools)
## st_options(plain.ascii = FALSE,        # Always use this option in Rmd documents
##            style        = "rmarkdown", # Always use this option in Rmd documents
##            footnote     = NA,          # Makes html-rendered results more concise
##            subtitle.emphasis = FALSE)  # Improves layout with some rmardown themes
## 

### Including summarytools’ CSS

The needed CSS is automatically added to html files created using print() or view() with the file argument. But in Rmarkdown documents, this needs to be done explicitly:

## {r, echo=FALSE}
## st_css()
## 

# 14. Conclusion

The package comes with no guarantees. It is a work in progress and feedback is always welcome. Please open an issue on GitHub if you find a bug or wish to submit a feature request.

### Stay Up-to-date

Check out the GitHub project’s page; from there you can see the latest updates and also submit feature requests.

For a preview of what’s coming in the next release, have a look at the development branch.