Introduction to summarytools

Dominic Comtois

2016-12-04

summarytools is an R package providing tools to neatly and quickly summarize data. Its main purpose is to provide functions that many R programmers once wished were included in base R. It also aims at making R a little easier to use for newcomeRs. With a few lines of very simple code, one can get a pretty good first look at the data at hand.

An emphasis has been put on both what and how results are presented, so that the package can serve both as a data exploration tool and a reporting tool than can be used either on its own for minimal reports, or integrated in a larger set of tools such as RStudio’s very good rmarkdown support.

The Three Main Functions

The package is built around three key functions:

Output Options

All summarytools objects returned by the main functions can be:

Additional Features

Bare-Bones Example

To show what default (console) outputs look like, we first generate a frequency table for iris$Species.

> freq(iris$Species)

Frequencies

Dataframe: iris
Variable: Species

                   N   % Valid   % Cum.Valid   % Total   % Cum.Total
---------------- --- --------- ------------- --------- -------------
          setosa  50     33.33         33.33     33.33         33.33
      versicolor  50     33.33         66.67     33.33         66.67
       virginica  50     33.33           100     33.33           100
            <NA>   0        NA            NA         0           100
           Total 150       100           100       100           100

To get familiar with the output styles, one can try different values for style= and see how results look in the console.

Markdown-Powered Outputs

When using style='rmarkdown' with freq() or descr(), the generated outputs are ready for markdown rendering. With dfSummary(), options for style are “multiline” (default) and “grid”, and plain.ascii=FALSE must be used to have proper line feeds in multiline cells.

Note: If building a document using an .Rmd document with knitr, always set the chunk option results to “asis”:

```{r example, results=‘asis’}
library(summarytools)
data(tobacco)
dfSummary(tobacco, plain.ascii=FALSE)
```

Descriptive (Univariate) Statistics With descr()

This function accepts both vectors and dataframes, in which case it will show statistics for all numerical variables in the dataframe. We’ll use one of the datasets included with the package.

> data(exams)
> descr(exams, style='rmarkdown')
Non-numerical variable(s) ignored: gender

Descriptive Statistics

Dataframe: exams

  english math geography history economics french
Mean 78.73 77.92 73.48 75.63 77.05 76.69
Std.Dev 8.53 7.31 10.51 9.3 8.41 11.47
Min 55.9 60.8 54.2 52.5 60.4 47
Max 95.7 90.9 94.9 97.6 92.2 98.1
Median 77.6 77.75 73.75 75.1 77.8 76.75
mad 6.45 5.86 6.15 6.52 6.97 8.15
IQR 9.37 7.63 8.12 8.6 10.6 9.92
CV 9.23 10.66 6.99 8.13 9.16 6.68
Skewness -0.15 -0.12 0.03 -0.38 0 0.05
SE.Skewness 0.43 0.43 0.44 0.44 0.43 0.46
Kurtosis 0.19 -0.43 -0.55 0.91 -0.75 0.27

Observations

  english math geography history economics french
Valid 30 (100%) 30 (100%) 28 (93.33%) 28 (93.33%) 29 (96.67%) 26 (86.67%)
<NA> 0 (0%) 0 (0%) 2 (6.67%) 2 (6.67%) 1 (3.33%) 4 (13.33%)
Total 30 (100%) 30 (100%) 30 (100%) 30 (100%) 30 (100%) 30 (100%)

To rather see variables in rows and stats in columns, we use transpose=TRUE:

> descr(exams, style = 'rmarkdown', transpose = TRUE)

Dataframe Summaries With dfSummary()

This is probably the most time-saving function I have ever written… For this one, we can use styles “multiline” (default) or “grid”. We must however specify plain.ascii=FALSE when using markdown, otherwise the rendered results will be problematic.

> data(tobacco)
> dfSummary(tobacco, plain.ascii = FALSE)

Dataframe Summary

tobacco

Variable Properties Stats / Values Freqs, % Valid N Valid
gender type:integer class:factor 1. Man
2. Woman
1: 151 (53.7%)
2: 130 (46.3%)
281/300 (93.7%)
age type:integer class:integer mean (sd) = 45.2 (16.27)
min < med < max = 18 < 45 < 75
IQR (CV) = 29 (0.36)
59 distinct values 281/300 (93.7%)
age.gr type:integer class:factor 1. 18-34
2. 35-50
3. 51-70
4. 71 +
1: 90 (32%)
2: 79 (28.1%)
3: 98 (34.9%)
4: 14 (5%)
281/300 (93.7%)
BMI type:double class:numeric mean (sd) = 25.24 (4.38)
min < med < max = 15.13 < 25.29 < 38.39
IQR (CV) = 5.8 (0.17)
282 distinct values 281/300 (93.7%)
smoker type:integer class:factor 1. Non Smoker
2. Smoker
1: 215 (73.4%)
2: 78 (26.6%)
293/300 (97.7%)
diseased type:integer class:factor 1. Diseased
2. Healthy
1: 36 (13.1%)
2: 238 (86.9%)
274/300 (91.3%)

Using “grid” adds space between cells…

> dfSummary(tobacco, style = 'grid', plain.ascii = FALSE)

Dataframe Summary

tobacco

Variable Properties Stats / Values Freqs, % Valid N Valid

gender

type:integer class:factor

  1. Man
  2. Woman

1: 151 (53.7%)
2: 130 (46.3%)

281/300 (93.7%)

age

type:integer class:integer

mean (sd) = 45.2 (16.27)
min < med < max = 18 < 45 < 75
IQR (CV) = 29 (0.36)

59 distinct values

281/300 (93.7%)

age.gr

type:integer class:factor

  1. 18-34
  2. 35-50
  3. 51-70
  4. 71 +

1: 90 (32%)
2: 79 (28.1%)
3: 98 (34.9%)
4: 14 (5%)

281/300 (93.7%)

BMI

type:double class:numeric

mean (sd) = 25.24 (4.38)
min < med < max = 15.13 < 25.29 < 38.39
IQR (CV) = 5.8 (0.17)

282 distinct values

281/300 (93.7%)

smoker

type:integer class:factor

  1. Non Smoker
  2. Smoker

1: 215 (73.4%)
2: 78 (26.6%)

293/300 (97.7%)

diseased

type:integer class:factor

  1. Diseased
  2. Healthy

1: 36 (13.1%)
2: 238 (86.9%)

274/300 (91.3%)

Redirecting Output

Text/Markdown Documents

With the file= parameter, we can redirect output into text files. And setting append=TRUE will append results to an existing text file:

> dfSummary(tobacco, file="tobacco.txt", style = "grid")  # Creates tobacco.txt
> descr(tobacco, file="tobacco.txt", append = TRUE)  # Appends results to tobacco.txt

HTML Documents

summarytools uses Bootstrap’s stylesheets to generate standalone HTML documents that can be displayed in a Web Browser or in RStudio’s Viewer using the generic print() function:

> print(dfSummary(tobacco), method = "browser")  # Displays results in default Web Browser
> print(dfSummary(tobacco), method = "viewer")   # Displays results in RStudio's Viewer
> view(dfSummary(tobacco))                       # Same as line above -- view() is a wrapper function

Using file= argument with an .html extension will simply generate an HTML document (without opening it).

> dfSummary(tobacco, file = "~/Documents/tobacco_summary.html")

Resulting document:

dfSummary in HTML format

Removing Temporary Files With cleartmp()

When calling print() or view(), a temporary HTML file is created in R’s temporary directory. To delete the last such file, use cleartmp(); to remove all temporary files generated in the Session, use cleartmp('all').

Getting Most Properties of an Object With what.is()

When developing, we often use a number functions to obtain an object’s properties. what.is() proposes to lump together the results of such functions (class(), typeof(), attributes() and others).

> what.is(iris)
$properties
      property      value
1        class data.frame
2       typeof       list
3         mode       list
4 storage.mode       list
5          dim    150 x 5
6       length          5
7    is.object       TRUE
8  object.type         S3
9  object.size 7088 Bytes

$attributes.lengths
    names row.names     class 
        5       150         1 

$extensive.is
[1] "is.data.frame" "is.list"       "is.object"     "is.recursive" 
[5] "is.unsorted"  

Learn more and stay up-to-date

Check the project’s page for more examples; from there you can also submit feature requests or signal problems you might encounter.

To install the package in its development version, use devtools::install_github('dcomtois/summarytools').

Final note

The source of this document is an .Rmd file; knitr’s chunk option results has been set to 'asis', to make sure formatting is not coming from knitr itself.