Introduction to summarytools

Dominic Comtois

2018-10-07

summarytools is an R package providing tools to neatly and quickly summarize data. It can also make R a little easier to learn and use. Four functions are at the core of the package:

An emphasis has been put on both what and how results are presented, so that the package can serve both as a data exploration and reporting tool, which can be used either on its own for minimal reports, or along with larger sets of tools such as RStudio’s for rmarkdown, and knitr.

Building on the strengths of pander and htmltools, the outputs produced by summarytools can be:

Latest Improvements

Version 0.8.3 brings several improvements to summarytools, notably:

summarytool’s Core Functions

1 - freq() : Frequency Tables

The freq() function generates a table of frequencies with counts and proportions. Since this page use markdown rendering, we’ll set style = 'rmarkdown' to take advantage of it.

library(summarytools)
freq(iris$Species, style = "rmarkdown")

Frequencies

Variable: iris$Species
Type: Factor (unordered)

  Freq % Valid % Valid Cum. % Total % Total Cum.
setosa 50 33.33 33.33 33.33 33.33
versicolor 50 33.33 66.67 33.33 66.67
virginica 50 33.33 100.00 33.33 100.00
<NA> 0 0.00 100.00
Total 150 100.00 100.00 100.00 100.00

If we do not worry about missing data, we can set report.nas = FALSE:

  Freq % % Cum.
setosa 50 33.33 33.33
versicolor 50 33.33 66.67
virginica 50 33.33 100.00
Total 150 100.00 100.00

We could simplify further and omit the Totals row by setting totals = FALSE.

To get familiar with the output styles, try different values for style= and see how results look in the console.

2 - ctable() : Cross-Tabulations

We’ll now use a sample data frame called tobacco, which is included in the package. We want to cross-tabulate the two categorical variables smoker and diseased. By default, ctable() gives row proportions, but we’ll include the full syntax anyway.

Since markdown has not support (yet) for multi-line headings, we’ll show an image of the resulting html table.

with(tobacco, print(ctable(smoker, diseased), method = 'render'))

Cross-Tabulation / Row Proportions

Variables: smoker * diseased
Data Frame: tobacco
diseased
smoker Yes No Total
Yes 125 (41.95%) 173 (58.05%)  298 (100.00%)
No  99 (14.10%) 603 (85.90%)  702 (100.00%)
Total 224 (22.40%) 776 (77.60%) 1000 (100.00%)

Notice that instead of ctable(tobacco$smoker, tobacco$diseased, ...), we used the with() function, making the syntax less redundant.

It is possible to display column, total, or no proportions at all. We can also omit the marginal totals to have a simple “2 x 2” table.

with(tobacco, 
     print(ctable(smoker, diseased, prop = 'n', totals = FALSE), 
           omit.headings = TRUE, method = "render"))
diseased
smoker Yes No
Yes 125 173
No 99 603

3 - descr() : Descriptive Univariate Stats

The descr() function generates common central tendency statistics and measures of dispersion for numerical data. It can handle single vectors as well as data frames, in which case it just ignores non-numerical columns (and displays a message to that effect).

descr(iris, style = "rmarkdown")
Non-numerical variable(s) ignored: Species

Descriptive Statistics

Data Frame: iris
N: 150

  Sepal.Length Sepal.Width Petal.Length Petal.Width
Mean 5.84 3.06 3.76 1.20
Std.Dev 0.83 0.44 1.77 0.76
Min 4.30 2.00 1.00 0.10
Q1 5.10 2.80 1.60 0.30
Median 5.80 3.00 4.35 1.30
Q3 6.40 3.30 5.10 1.80
Max 7.90 4.40 6.90 2.50
MAD 1.04 0.44 1.85 1.04
IQR 1.30 0.50 3.50 1.50
CV 0.14 0.14 0.47 0.64
Skewness 0.31 0.31 -0.27 -0.10
SE.Skewness 0.20 0.20 0.20 0.20
Kurtosis -0.61 0.14 -1.42 -1.36
N.Valid 150.00 150.00 150.00 150.00
Pct.Valid 100.00 100.00 100.00 100.00

Transposing and selecting only the stats you need

If your eyes/brain prefer seeing things the other way around, just use transpose = TRUE. Here, we also select only the statistics we wish to see, and specify omit.headings = TRUE to avoid reprinting the same information as above.

Non-numerical variable(s) ignored: Species
  Mean Std.Dev Min Median Max
Sepal.Length 5.84 0.83 4.30 5.80 7.90
Sepal.Width 3.06 0.44 2.00 3.00 4.40
Petal.Length 3.76 1.77 1.00 4.35 6.90
Petal.Width 1.20 0.76 0.10 1.30 2.50

4 - dfSummary() : Data Frame Summaries

dfSummary() collects information about all variables in a data frame and displays it in a singe, legible table.

To generate a summary report and have it displayed in RStudio’s Viewer pane (or in your default Web Browser if working with another interface), we simply do like this:

view(dfSummary(iris))

It is also possible to use dfSummary() in Rmarkdown documents. In this next example, note that due to rmarkdown compatibility issues, histograms are not shown. We’re working on this. Further down, we’ll see how tu use html rendering to go around this problem.

dfSummary(tobacco, plain.ascii = FALSE, style = "grid")

Data Frame Summary

tobacco
N: 1000

No Variable Stats / Values Freqs (% of Valid) Text Graph Valid Missing
1 gender
[factor]
1. F
2. M
489 (50.0%)
489 (50.0%)
IIIIIIIIIIIIIIII
IIIIIIIIIIIIIIII
978
(97.8%)
22
(2.2%)
2 age
[numeric]
mean (sd) : 49.6 (18.29)
min < med < max :
18 < 50 < 80
IQR (CV) : 32 (0.37)
63 distinct values 975
(97.5%)
25
(2.5%)
3 age.gr
[factor]
1. 18-34
2. 35-50
3. 51-70
4. 71 +
258 (26.5%)
241 (24.7%)
317 (32.5%)
159 (16.3%)
IIIIIIIIIIIII
IIIIIIIIIIII
IIIIIIIIIIIIIIII
IIIIIIII
975
(97.5%)
25
(2.5%)
4 BMI
[numeric]
mean (sd) : 25.73 (4.49)
min < med < max :
8.83 < 25.62 < 39.44
IQR (CV) : 5.72 (0.17)
974 distinct values 974
(97.4%)
26
(2.6%)
5 smoker
[factor]
1. Yes
2. No
298 (29.8%)
702 (70.2%)
IIIIII
IIIIIIIIIIIIIIII
1000
(100%)
0
(0%)
6 cigs.per.day
[numeric]
mean (sd) : 6.78 (11.88)
min < med < max :
0 < 0 < 40
IQR (CV) : 11 (1.75)
37 distinct values 965
(96.5%)
35
(3.5%)
7 diseased
[factor]
1. Yes
2. No
224 (22.4%)
776 (77.6%)
IIII
IIIIIIIIIIIIIIII
1000
(100%)
0
(0%)
8 disease
[character]
1. Hypertension
2. Cancer
3. Cholesterol
4. Heart
5. Pulmonary
6. Musculoskeletal
7. Diabetes
8. Hearing
9. Digestive
10. Hypotension
[ 3 others ]
36 (16.2%)
34 (15.3%)
21 ( 9.5%)
20 ( 9.0%)
20 ( 9.0%)
19 ( 8.6%)
14 ( 6.3%)
14 ( 6.3%)
12 ( 5.4%)
11 ( 5.0%)
21 ( 9.5%)
IIIIIIIIIIIIIIII
IIIIIIIIIIIIIII
IIIIIIIII
IIIIIIII
IIIIIIII
IIIIIIII
IIIIII
IIIIII
IIIII
IIII
IIIIIIIII
222
(22.2%)
778
(77.8%)
9 samp.wgts
[numeric]
mean (sd) : 1 (0.08)
min < med < max :
0.86 < 1.04 < 1.06
IQR (CV) : 0.19 (0.08)
0.86!: 267 (26.7%)
1.04!: 249 (24.9%)
1.05!: 324 (32.4%)
1.06!: 160 (16.0%)
! rounded
IIIIIIIIIIIII
IIIIIIIIIIII
IIIIIIIIIIIIIIII
IIIIIII

1000
(100%)
0
(0%)

The print() and view() Functions

summarytools has a generic print method, print.summarytools(). By default, its method argument is set to 'pander'. One of the ways in which view() is useful is that we can use it to easily display html outputs in RStudio’s Viewer. In this case, the view() function simply acts as a wrapper around the generic print() function, specifying the method = 'viewer' for us. When used outside RStudio, the method falls back on 'browser' and the report is fired up in the system’s default browser.

Using by() to Show Results By Groups

With freq() and descr(), you can use R’s base function by() to show statistics split by a ventilation / categorical variable. R’s by() function returns a list containing as many summarytools objects as there are categories in our ventilation variable.

To propertly display the content present in that list, we use the view() function. Using print(), while technically possible, will not give as much satisfactory results.

Example

Using the iris data frame, we will display descriptive statistics broken down by Species.

Descriptive Statistics

Data Frame: iris
Group: Species = setosa
N: 50

  Mean Std.Dev Min Median Max
Sepal.Length 5.01 0.35 4.30 5.00 5.80
Sepal.Width 3.43 0.38 2.30 3.40 4.40
Petal.Length 1.46 0.17 1.00 1.50 1.90
Petal.Width 0.25 0.11 0.10 0.20 0.60

Group: Species = versicolor
N: 50

  Mean Std.Dev Min Median Max
Sepal.Length 5.94 0.52 4.90 5.90 7.00
Sepal.Width 2.77 0.31 2.00 2.80 3.40
Petal.Length 4.26 0.47 3.00 4.35 5.10
Petal.Width 1.33 0.20 1.00 1.30 1.80

Group: Species = virginica
N: 50

  Mean Std.Dev Min Median Max
Sepal.Length 6.59 0.64 4.90 6.50 7.90
Sepal.Width 2.97 0.32 2.20 3.00 3.80
Petal.Length 5.55 0.55 4.50 5.55 6.90
Petal.Width 2.03 0.27 1.40 2.00 2.50

To see an html version of these results, we’d simply do this (results not shown):

Special Case - Using descr() With by() For A Single Variable

Instead of showing several tables having only one column each, the view() function will assemble the results into a single table:

Descriptive Statistics

Variable: tobacco$BMI by age.gr

  18-34 35-50 51-70 71 +
Mean 23.84 25.11 26.91 27.45
Std.Dev 4.23 4.34 4.26 4.37
Min 8.83 10.35 9.01 16.36
Median 24.04 25.11 26.77 27.52
Max 34.84 39.44 39.21 38.37

The transposed version looks like this:

  Mean Std.Dev Min Median Max
18-34 23.84 4.23 8.83 24.04 34.84
35-50 25.11 4.34 10.35 25.11 39.44
51-70 26.91 4.26 9.01 26.77 39.21
71 + 27.45 4.37 16.36 27.52 38.37

Using lapply() to Show Several freq() tables at once

As is the case for by(), the view() function is essential for making results nice and tidy.

tobacco_subset <- tobacco[ ,c("gender", "age.gr", "smoker")]
freq_tables <- lapply(tobacco_subset, freq)
view(freq_tables, footnote = NA, file = 'freq-tables.html')

Using summarytools in Rmarkdown documents

As we have seen, summarytools can generate both text (including rmarkdown) and html results. Both can be used in Rmarkdown, according to your preferences. There is a vignette dedicated to this, which gives several examples, but if you’re in a hurry, here are a few tips to get started:

    knitr::opts_chunk$set(echo = TRUE, results = 'asis')

          Refer to this page for more on knitr’s options.

Example

# ---
# title: "RMarkdown using summarytools"
# output: 
#   html_document: 
#     css: C:/R/win-library/3.4/summarytools/includes/stylesheets/summarytools.css
# ---

For more details on using summarytools in Rmarkdown documents, please refer to the corresponding vignette.

Writing Output to Files

The console will always tell you the location of the temporary html file that is created in the process. However, you can specify the name and location of that file explicitly if you need to reuse it later on:

view(iris_stats_by_species, file = "~/iris_stats_by_species.html")

Based on the file extension you provide (.html vs others), summarytools will use the appropriate method; there is no need to specify the method argument.

Appending output files

There is also an append = logical argument for adding content to existing reports, both text/Rmarkdown and html. This is useful if you want to quickly include several statistical tables in a single file. It is fast alternative to creating an .Rmd document if you don’t need the extra content that the latter allows.

Global options

Version 0.8.3 introduced the following set of global options:

Examples

Overriding formatting attributes

When a summarytools object is stored, its formatting attributes are stored with it. However, you can override most of them when using the print() and view() functions.

Example

  Freq % % Cum.
18-34 258 26.46 26.46
35-50 241 24.72 51.18
51-70 317 32.51 83.69
71 + 159 16.31 100.00

Note that the omitted attributes are stil part of the age_stats object.

Order of Priority for Options / Parameters

  1. Options over-ridden explicitly with print() or view() have precendence
  2. options specified as explicit arguments to freq() / ctable() / descr() / dfSummary() come second
  3. Global options, which can be set with st_options, come third

Customizing looks with CSS

Version 0.8 of summarytools uses RStudio’s htmltools package and version 4 of Bootstrap’s cascading stylesheets.

It is possible to include your own css if you wish to customize the look of the output tables. See the documentation for the package’s print.summarytools() function for details, but here is a quick example to give you the gist of it.

Example

Say you need to make the font size really small, smaller than by using the st-small class as seen in a previous example. For this, you would create a CSS file - let’s call it “custom.css” - containing a class such as the following:

Then, to apply it to a summarytools object and display it in your browser:

Working with shiny apps

To include summarytools functions into shiny apps, it is recommended that you:

Example

Getting Most Properties of an Object With what.is()

When developing, we often use a number of functions to obtain an object’s properties. what.is() proposes to lump together the results of most of these functions (class(), typeof(), attributes() and others).

what.is(iris)
$properties
      property      value
1        class data.frame
2       typeof       list
3         mode       list
4 storage.mode       list
5          dim    150 x 5
6       length          5
7    is.object       TRUE
8  object.type         S3
9  object.size 7256 Bytes

$attributes.lengths
    names     class row.names 
        5         1       150 

$extensive.is
[1] "is.data.frame" "is.list"       "is.object"     "is.recursive" 
[5] "is.unsorted"  

Limitations

Stay Up-to-date

Check out the project’s page - from there you can see the latest updates and also submit feature requests.

To install the version of summarytools that is on CRAN, but that might have benefited from quick fixes:

install.packages('devtools')
library(devtools)
install_github('dcomtois/summarytools')

To install the package in its development version, use

install_github('dcomtois/summarytools', ref='dev-current')

Final notes

The package comes with no guarantees. It is a work in progress and feedback / feature requests are welcome. Just send me an email (dominic.comtois (at) gmail.com), or open an Issue on GitHub if you find a bug.