summarytools is an R package providing tools to neatly and quickly summarize data. Its main purpose is to provide functions that many R programmers once wished were included in base R. It also aims at making R a little easier to use for newcomeRs. With a few lines of very simple code, one can get a pretty good first look at the data at hand.
An emphasis has been put on both what and how results are presented, so that the package can serve both as a data exploration tool and a reporting tool than can be used either on its own for minimal reports, or integrated in a larger set of tools such as RStudio’s very good rmarkdown support.
The package is built around three key functions:
freq()
– Frequency tables with proportions, cumulative proportions and missing data informationdescr()
– Descriptive (univariate) statistics for numerical vectors which is more thorough than fivenum()
and other similar functionsdfSummary()
– Dataframe summaries that facilitate data cleaning and firsthand evaluationAll summarytools objects returned by the main functions can be:
file=
argumentprint()
or view()
functionsfreq()
and descr()
support sampling weightsHmisc::label
)what.is()
which combines results from functions such as class()
, attributes()
, typeof()
and others to give an extensive description of an object’s properties. It also checks the object against most is.
functions and returns the list of matching elements.To show what default (console) outputs look like, we first generate a frequency table for iris$Species
.
> freq(iris$Species)
Frequencies
Dataframe: iris
Variable: Species
N % Valid % Cum.Valid % Total % Cum.Total
---------------- --- --------- ------------- --------- -------------
setosa 50 33.33 33.33 33.33 33.33
versicolor 50 33.33 66.67 33.33 66.67
virginica 50 33.33 100 33.33 100
<NA> 0 NA NA 0 100
Total 150 100 100 100 100
To get familiar with the output styles, one can try different values for style=
and see how results look in the console.
When using style='rmarkdown'
with freq()
or descr()
, the generated outputs are ready for markdown rendering. With dfSummary()
, options for style
are “multiline” (default) and “grid”, and plain.ascii=FALSE
must be used to have proper line feeds in multiline cells.
Note: If building a document using an .Rmd document with knitr, always set the chunk option results
to “asis”:
```{r example, results=‘asis’}
library(summarytools)
data(tobacco)
dfSummary(tobacco, plain.ascii=FALSE)
```
descr()
This function accepts both vectors and dataframes, in which case it will show statistics for all numerical variables in the dataframe. We’ll use one of the datasets included with the package.
> data(exams)
> descr(exams, style='rmarkdown')
Non-numerical variable(s) ignored: gender
english | math | geography | history | economics | french | |
---|---|---|---|---|---|---|
Mean | 78.73 | 77.92 | 73.48 | 75.63 | 77.05 | 76.69 |
Std.Dev | 8.53 | 7.31 | 10.51 | 9.3 | 8.41 | 11.47 |
Min | 55.9 | 60.8 | 54.2 | 52.5 | 60.4 | 47 |
Max | 95.7 | 90.9 | 94.9 | 97.6 | 92.2 | 98.1 |
Median | 77.6 | 77.75 | 73.75 | 75.1 | 77.8 | 76.75 |
mad | 6.45 | 5.86 | 6.15 | 6.52 | 6.97 | 8.15 |
IQR | 9.37 | 7.63 | 8.12 | 8.6 | 10.6 | 9.92 |
CV | 9.23 | 10.66 | 6.99 | 8.13 | 9.16 | 6.68 |
Skewness | -0.15 | -0.12 | 0.03 | -0.38 | 0 | 0.05 |
SE.Skewness | 0.43 | 0.43 | 0.44 | 0.44 | 0.43 | 0.46 |
Kurtosis | 0.19 | -0.43 | -0.55 | 0.91 | -0.75 | 0.27 |
english | math | geography | history | economics | french | |
---|---|---|---|---|---|---|
Valid | 30 (100%) | 30 (100%) | 28 (93.33%) | 28 (93.33%) | 29 (96.67%) | 26 (86.67%) |
<NA> | 0 (0%) | 0 (0%) | 2 (6.67%) | 2 (6.67%) | 1 (3.33%) | 4 (13.33%) |
Total | 30 (100%) | 30 (100%) | 30 (100%) | 30 (100%) | 30 (100%) | 30 (100%) |
To rather see variables in rows and stats in columns, we use transpose=TRUE
:
> descr(exams, style = 'rmarkdown', transpose = TRUE)
dfSummary()
This is probably the most time-saving function I have ever written… For this one, we can use styles “multiline” (default) or “grid”. We must however specify plain.ascii=FALSE
when using markdown, otherwise the rendered results will be problematic.
> data(tobacco)
> dfSummary(tobacco, plain.ascii = FALSE)
Variable | Properties | Stats / Values | Freqs, % Valid | N Valid |
---|---|---|---|---|
gender | type:integer class:factor | 1. Man 2. Woman |
1: 151 (53.7%) 2: 130 (46.3%) |
281/300 (93.7%) |
age | type:integer class:integer | mean (sd) = 45.2 (16.27) min < med < max = 18 < 45 < 75 IQR (CV) = 29 (0.36) |
59 distinct values | 281/300 (93.7%) |
age.gr | type:integer class:factor | 1. 18-34 2. 35-50 3. 51-70 4. 71 + |
1: 90 (32%) 2: 79 (28.1%) 3: 98 (34.9%) 4: 14 (5%) |
281/300 (93.7%) |
BMI | type:double class:numeric | mean (sd) = 25.24 (4.38) min < med < max = 15.13 < 25.29 < 38.39 IQR (CV) = 5.8 (0.17) |
282 distinct values | 281/300 (93.7%) |
smoker | type:integer class:factor | 1. Non Smoker 2. Smoker |
1: 215 (73.4%) 2: 78 (26.6%) |
293/300 (97.7%) |
diseased | type:integer class:factor | 1. Diseased 2. Healthy |
1: 36 (13.1%) 2: 238 (86.9%) |
274/300 (91.3%) |
Using “grid” adds space between cells…
> dfSummary(tobacco, style = 'grid', plain.ascii = FALSE)
Variable | Properties | Stats / Values | Freqs, % Valid | N Valid |
---|---|---|---|---|
gender |
type:integer class:factor |
|
1: 151 (53.7%) |
281/300 (93.7%) |
age |
type:integer class:integer |
mean (sd) = 45.2 (16.27) |
59 distinct values |
281/300 (93.7%) |
age.gr |
type:integer class:factor |
|
1: 90 (32%) |
281/300 (93.7%) |
BMI |
type:double class:numeric |
mean (sd) = 25.24 (4.38) |
282 distinct values |
281/300 (93.7%) |
smoker |
type:integer class:factor |
|
1: 215 (73.4%) |
293/300 (97.7%) |
diseased |
type:integer class:factor |
|
1: 36 (13.1%) |
274/300 (91.3%) |
With the file=
parameter, we can redirect output into text files. And setting append=TRUE
will append results to an existing text file:
> dfSummary(tobacco, file="tobacco.txt", style = "grid") # Creates tobacco.txt
> descr(tobacco, file="tobacco.txt", append = TRUE) # Appends results to tobacco.txt
summarytools uses Bootstrap’s stylesheets to generate standalone HTML documents that can be displayed in a Web Browser or in RStudio’s Viewer using the generic print()
function:
> print(dfSummary(tobacco), method = "browser") # Displays results in default Web Browser
> print(dfSummary(tobacco), method = "viewer") # Displays results in RStudio's Viewer
> view(dfSummary(tobacco)) # Same as line above -- view() is a wrapper function
Using file=
argument with an .html extension will simply generate an HTML document (without opening it).
> dfSummary(tobacco, file = "~/Documents/tobacco_summary.html")
Resulting document:
cleartmp()
When calling print()
or view()
, a temporary HTML file is created in R’s temporary directory. To delete the last such file, use cleartmp()
; to remove all temporary files generated in the Session, use cleartmp('all')
.
what.is()
When developing, we often use a number functions to obtain an object’s properties. what.is()
proposes to lump together the results of such functions (class()
, typeof()
, attributes()
and others).
> what.is(iris)
$properties
property value
1 class data.frame
2 typeof list
3 mode list
4 storage.mode list
5 dim 150 x 5
6 length 5
7 is.object TRUE
8 object.type S3
9 object.size 7088 Bytes
$attributes.lengths
names row.names class
5 150 1
$extensive.is
[1] "is.data.frame" "is.list" "is.object" "is.recursive"
[5] "is.unsorted"
Check the project’s page for more examples; from there you can also submit feature requests or signal problems you might encounter.
To install the package in its development version, use devtools::install_github('dcomtois/summarytools')
.
The source of this document is an .Rmd file; knitr’s chunk option results
has been set to 'asis'
, to make sure formatting is not coming from knitr itself.