skimr
is designed to provide summary statistics about variables. It is opinionated in its defaults, but easy to modify.
In base R, the most similar functions are summary()
for vectors and data frames and fivenum()
for numeric vectors:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
summary(iris$Sepal.Length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
fivenum(iris$Sepal.Length)
## [1] 4.3 5.1 5.8 6.4 7.9
summary(iris$Species)
## setosa versicolor virginica
## 50 50 50
skim
functionThe core function of skimr is skim()
. skim()
is a S3 generic function, with methods for data frames, grouped data frames and vectors. Like summary()
, skim()
’s method for data frames presents results for every column; the statistics it provides depend on the class of the variable.
By design, the main focus of skimr
is on data frames; it is intended to fit well withiin a data pipeline and relies extensively on tidyverse vocabulary, which focuses on data frames.
Results of skim()
are printed horizontally, with one section per variable type and one row per variable. Results are returned from skim()
as a long tibble of class skim_df
, with one row per variable + summary statistic.
library(skimr)
skim(iris)
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ── Variable type:factor ─────────────────────────────────────────────
## variable missing complete n n_unique top_counts
## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0
## ordered
## FALSE
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9
## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4
## hist
## ▇▁▁▂▅▅▃▁
## ▇▁▁▅▃▃▂▂
## ▂▇▅▇▆▅▂▂
## ▁▂▅▇▃▂▁▁
This is in contrast to summary.data.frame()
, which stores statistics in a table
. The distinction is important, because the skim_df
object is pipeable and easy to use for additional manipulation: for example, the user could select all of the variable means, or all summary statistics for a specific variable.
skim(iris) %>%
dplyr::filter(stat == "mean")
## # A tibble: 4 x 6
## variable type stat level value formatted
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Sepal.Length numeric mean .all 5.84 5.84
## 2 Sepal.Width numeric mean .all 3.06 3.06
## 3 Petal.Length numeric mean .all 3.76 3.76
## 4 Petal.Width numeric mean .all 1.20 1.2
The skim_df
object always contains 6 columns:
variable
: name of the original variabletype
: class of the variablestat
: name of the summary statistic (becomes the column name when the object is printed)level
: used when summary functions returns multiple values when skimming; for example, counts of levels for factor variables, or when setting multiple values to the probs
argument of the quantiles
functionvalue
: actual calculated value of the statistic; always numeric and should be used for further calculationsformatted
: formatted character version of value
; attempts to use a reasonable number of digits (decimal aligned) and puts values like dates into human readable formatss <- skim(iris)
head(s, 15)
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100
## Sepal.Length 0 150 150 5.84 NA.83 NA.3 NA.1 NA.8 NA.4 NA.9
## Sepal.Width 0 150 150 3.06 NA NA NA NA NA NA
## hist
## ▂▇▅▇▆▅▂▂
## <NA>
skim()
also supports grouped data. In this case, one additional column for each grouping variable is added to the skim_df
object.
mtcars %>%
dplyr::group_by(gear) %>%
skim()
## Skim summary statistics
## n obs: 32
## n variables: 11
## group variables: gear
##
## ── Variable type:numeric ────────────────────────────────────────────
## gear variable missing complete n mean sd p0 p25 p50
## 3 am 0 15 15 0 0 0 0 0
## 3 carb 0 15 15 2.67 1.18 1 2 3
## 3 cyl 0 15 15 7.47 1.19 4 8 8
## 3 disp 0 15 15 326.3 94.85 120.1 275.8 318
## 3 drat 0 15 15 3.13 0.27 2.76 3.04 3.08
## 3 hp 0 15 15 176.13 47.69 97 150 180
## 3 mpg 0 15 15 16.11 3.37 10.4 14.5 15.5
## 3 qsec 0 15 15 17.69 1.35 15.41 17.04 17.42
## 3 vs 0 15 15 0.2 0.41 0 0 0
## 3 wt 0 15 15 3.89 0.83 2.46 3.45 3.73
## 4 am 0 12 12 0.67 0.49 0 0 1
## 4 carb 0 12 12 2.33 1.3 1 1 2
## 4 cyl 0 12 12 4.67 0.98 4 4 4
## 4 disp 0 12 12 123.02 38.91 71.1 78.92 130.9
## 4 drat 0 12 12 4.04 0.31 3.69 3.9 3.92
## 4 hp 0 12 12 89.5 25.89 52 65.75 94
## 4 mpg 0 12 12 24.53 5.28 17.8 21 22.8
## 4 qsec 0 12 12 18.96 1.61 16.46 18.46 18.75
## 4 vs 0 12 12 0.83 0.39 0 1 1
## 4 wt 0 12 12 2.62 0.63 1.61 2.13 2.7
## 5 am 0 5 5 1 0 1 1 1
## 5 carb 0 5 5 4.4 2.61 2 2 4
## 5 cyl 0 5 5 6 2 4 4 6
## 5 disp 0 5 5 202.48 115.49 95.1 120.3 145
## 5 drat 0 5 5 3.92 0.39 3.54 3.62 3.77
## 5 hp 0 5 5 195.6 102.83 91 113 175
## 5 mpg 0 5 5 21.38 6.66 15 15.8 19.7
## 5 qsec 0 5 5 15.64 1.13 14.5 14.6 15.5
## 5 vs 0 5 5 0.2 0.45 0 0 0
## 5 wt 0 5 5 2.63 0.82 1.51 2.14 2.77
## p75 p100 hist
## 0 0 ▁▁▁▇▁▁▁▁
## 4 4 ▅▁▆▁▁▅▁▇
## 8 8 ▁▁▁▁▁▁▁▇
## 380 472 ▂▁▂▇▃▆▂▆
## 3.18 3.73 ▃▃▇▆▁▁▁▃
## 210 245 ▅▁▃▁▇▂▂▅
## 18.4 21.5 ▃▁▃▇▃▃▂▃
## 17.99 20.22 ▃▁▆▇▆▁▂▃
## 0 1 ▇▁▁▁▁▁▁▂
## 3.96 5.42 ▁▁▇▅▁▁▁▃
## 1 1 ▃▁▁▁▁▁▁▇
## 4 4 ▇▁▇▁▁▁▁▇
## 6 6 ▇▁▁▁▁▁▁▃
## 160 167.6 ▇▁▁▂▂▂▂▇
## 4.09 4.93 ▁▇▃▁▁▁▁▁
## 110 123 ▂▇▁▁▃▁▆▃
## 28.08 33.9 ▅▇▅▂▂▁▂▅
## 19.58 22.9 ▃▁▇▆▃▁▁▂
## 1 1 ▂▁▁▁▁▁▁▇
## 3.16 3.44 ▇▃▃▃▃▇▇▇
## 1 1 ▁▁▁▇▁▁▁▁
## 6 8 ▇▁▃▁▁▃▁▃
## 8 8 ▇▁▁▃▁▁▁▇
## 301 351 ▇▃▁▁▁▁▃▃
## 4.22 4.43 ▇▁▃▁▁▁▃▃
## 264 335 ▇▁▃▁▁▃▁▃
## 26 30.4 ▇▁▃▁▁▃▁▃
## 16.7 16.9 ▇▁▁▃▁▁▁▇
## 0 1 ▇▁▁▁▁▁▁▂
## 3.17 3.57 ▇▁▇▁▇▁▇▇
Individual columns from a data frame may be selected using tidyverse-style selectors.
skim(iris, Sepal.Length, Species)
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ── Variable type:factor ─────────────────────────────────────────────
## variable missing complete n n_unique top_counts
## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0
## ordered
## FALSE
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▂▇▅▇▆▅▂▂
skim(iris, starts_with("Sepal"))
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▂▇▅▇▆▅▂▂
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4 ▁▂▅▇▃▂▁▁
If an individual column is of an unsuppported class, it is treated as a character variable with a warning.
skim()
also handles individual vectors that are not part of a data frame. For example, the lynx
data set is class ts
.
skim(datasets::lynx)
##
## Skim summary statistics
##
## ── Variable type:ts ─────────────────────────────────────────────────
## variable missing complete n start end frequency deltat mean
## datasets::lynx 0 114 114 1821 1934 1 1 1538.02
## sd min max median line_graph
## 1585.84 39 6991 771 ⡈⢄⡠⢁⣀⠒⣀⠔
If you attempt to use skim()
on a class that does not have support, it will coerce it to character (with a warning) and report number of NA
s; number complete (non-missing); number of rows; number empty strings (i.e. “”); minimum and maximum lengths of non-empty strings; and number of unique values.
lynx <- datasets::lynx
class(lynx) <- "unknown_class"
skim(lynx)
## Warning: No summary functions for vectors of class: unknown_class.
## Coercing to character
##
## Skim summary statistics
##
## ── Variable type:character ──────────────────────────────────────────
## variable missing complete n min max empty n_unique
## lynx 0 114 114 2 4 0 110
skimr
does not include a skim.matrix
function in order to preserve the ability to handle matrices in flexible ways (in contrast to summary.matrix()
). Three possible ways to handle matrices with skim()
parallel the three variations of the mean function for matrices.
m <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), nrow = 4, ncol = 3)
m
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
colMeans(m)
## [1] 2.5 6.5 10.5
rowMeans(m)
## [1] 5 6 7 8
mean(m)
## [1] 6.5
skim(as.data.frame(m)) # Similar to summary.matrix and colMeans()
## Skim summary statistics
## n obs: 4
## n variables: 3
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## V1 0 4 4 2.5 1.29 1 1.75 2.5 3.25 4 ▇▁▇▁▁▇▁▇
## V2 0 4 4 6.5 1.29 5 5.75 6.5 7.25 8 ▇▁▇▁▁▇▁▇
## V3 0 4 4 10.5 1.29 9 9.75 10.5 11.25 12 ▇▁▇▁▁▇▁▇
skim(as.data.frame(t(m))) # Similar to rowMeans()
## Skim summary statistics
## n obs: 3
## n variables: 4
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## V1 0 3 3 5 4 1 3 5 7 9 ▇▁▁▇▁▁▁▇
## V2 0 3 3 6 4 2 4 6 8 10 ▇▁▁▇▁▁▁▇
## V3 0 3 3 7 4 3 5 7 9 11 ▇▁▁▇▁▁▁▇
## V4 0 3 3 8 4 4 6 8 10 12 ▇▁▁▇▁▁▁▇
skim(c(m)) # Similar to mean()
##
## Skim summary statistics
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## c(m) 0 12 12 6.5 3.61 1 3.75 6.5 9.25 12 ▇▃▇▃▃▇▃▇
You can skim a single row or column in the same way as any vector.
skim(m[,1])
##
## Skim summary statistics
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## m[, 1] 0 4 4 2.5 1.29 1 1.75 2.5 3.25 4 ▇▁▇▁▁▇▁▇
skim(m[1,])
##
## Skim summary statistics
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## m[1, ] 0 3 3 5 4 1 3 5 7 9 ▇▁▁▇▁▁▁▇
skim()
skim()
for a data frame returns a long, six-column data frame. This long data frame is printed horizontally as a separate summary for each data type found in the data frame, but the object itself is not transformed during the print.
Three other functions are available that may prove useful as part of skim()
workflows:
skim_tee()
produces the same printed version as skim()
but returns the original, unmodified data frame. This allows for continued piping of the original data.iris_setosa <- iris %>%
skim_tee(iris) %>%
dplyr::filter(Species == "setosa")
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ── Variable type:factor ─────────────────────────────────────────────
## variable missing complete n n_unique top_counts
## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0
## ordered
## FALSE
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9
## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4
## hist
## ▇▁▁▂▅▅▃▁
## ▇▁▁▅▃▃▂▂
## ▂▇▅▇▆▅▂▂
## ▁▂▅▇▃▂▁▁
skim_to_list()
returns a named list of the wide data frames for each data type. These data frames contain the formatted, character values, meaning that they are most useful for display. In general, users will want to store the results in an object for further handling.iris %>% skim_to_list()
## $factor
## # A tibble: 1 x 7
## variable missing complete n n_unique top_counts ordered
## * <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Species 0 150 150 3 set: 50, ver: 50, vir: … FALSE
##
## $numeric
## # A tibble: 4 x 12
## variable missing complete n mean sd p0 p25 p50 p75 p100
## * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Petal.L… 0 150 150 3.76 1.77 "1 " 1.6 4.35 5.1 6.9
## 2 Petal.W… 0 150 150 "1.2… 0.76 0.1 0.3 "1.3… 1.8 2.5
## 3 Sepal.L… 0 150 150 5.84 0.83 4.3 5.1 "5.8… 6.4 7.9
## 4 Sepal.W… 0 150 150 3.06 0.44 "2 " 2.8 "3 … 3.3 4.4
## # … with 1 more variable: hist <chr>
iris_skimmed <- iris %>% skim_to_list()
iris_skimmed[["numeric"]] %>% dplyr::select(mean, sd)
## # A tibble: 4 x 2
## mean sd
## * <chr> <chr>
## 1 3.76 1.77
## 2 "1.2 " 0.76
## 3 5.84 0.83
## 4 3.06 0.44
skim_to_wide()
returns a single data frame with each variable in a row, again using formatted, character values. Variables that do not report a given statistic are assigned NA for that statistic. The results may be sparse and users should be aware that statistics such as mean that apply over many types of data (such as dates) should be analyzed carefully.iris %>% skim_to_wide(iris)
## # A tibble: 5 x 16
## type variable missing complete n n_unique top_counts ordered mean
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 fact… Species 0 150 150 3 set: 50, … FALSE <NA>
## 2 nume… Petal.L… 0 150 150 <NA> <NA> <NA> 3.76
## 3 nume… Petal.W… 0 150 150 <NA> <NA> <NA> "1.2…
## 4 nume… Sepal.L… 0 150 150 <NA> <NA> <NA> 5.84
## 5 nume… Sepal.W… 0 150 150 <NA> <NA> <NA> 3.06
## # … with 7 more variables: sd <chr>, p0 <chr>, p25 <chr>, p50 <chr>,
## # p75 <chr>, p100 <chr>, hist <chr>
skimr
is opinionated in its choice of defaults, but users can easily add to, replace, or remove the statistics for a class.
To add a statistic, create a named list for each class using the format below:
classname = list(mad_name = mad)
skim_with(numeric = list(mad_name = mad))
skim(datasets::chickwts)
## Skim summary statistics
## n obs: 71
## n variables: 2
##
## ── Variable type:factor ─────────────────────────────────────────────
## variable missing complete n n_unique top_counts
## feed 0 71 71 6 soy: 14, cas: 12, lin: 12, sun: 12
## ordered
## FALSE
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100
## weight 0 71 71 261.31 78.07 108 204.5 258 323.5 423
## hist mad_name
## ▃▅▅▇▃▇▂▂ 91.92
When skim_with()
is used to modify the statistics, the new list(s) of statistics remains in place until they are reset using skim_with_defaults()
.
By default skim_with()
appends the new statistics, but setting append = FALSE
replaces the defaults.
skim_with_defaults()
skim_with(numeric = list(mad_name = mad), append = FALSE)
skim(datasets::chickwts)
## Skim summary statistics
## n obs: 71
## n variables: 2
##
## ── Variable type:factor ─────────────────────────────────────────────
## variable missing complete n n_unique top_counts
## feed 0 71 71 6 soy: 14, cas: 12, lin: 12, sun: 12
## ordered
## FALSE
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable mad_name
## weight 91.92
skim_with_defaults() # Reset to defaults
You can also use skim_with()
to remove specific statistics by setting them to NULL
.
skim_with(numeric = list(hist = NULL))
skim(datasets::chickwts)
## Skim summary statistics
## n obs: 71
## n variables: 2
##
## ── Variable type:factor ─────────────────────────────────────────────
## variable missing complete n n_unique top_counts
## feed 0 71 71 6 soy: 14, cas: 12, lin: 12, sun: 12
## ordered
## FALSE
##
## ── Variable type:numeric ────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100
## weight 0 71 71 261.31 78.07 108 204.5 258 323.5 423
skim_with_defaults()
When printing, skimr
formats displayed statistics in an opinionated way; these values are stored in the formatted
column of the skim_df
object and are always character. skim()
attempts to use a reasonable number of decimal places for calculated values based on the data type (integer or numeric) and number of stored decimals. For statistics such as p0
and p100
, the actual stored values are displayed. Decimals in a column are aligned. Date formats are used for date statistics.
Users can override these opinionated formats using skim_format()
. show_formats()
will display the current options in use for each data type. Using skim_format_defaults()
will reset the formats to their default settings.
skim()
The skim_df
object is a long data frame with one row for each combination of variable and statistic (and optionally for group). The horizontal display is created by default using print.skim_df()
; users can specify additional options by explicitly calling print([skim_df object], ...)
.
skim_df()
objects can also be rendered using kable()
and pander()
. These both provide more control over the rendered results, particularly when used in conjunction with knitr
. Documentation of these options is covered in more detail in the knitr
package for kable()
and the pander
package for pander()
. Using either of these may require use of document or chunk options and fonts, including a chunk option of results = 'asis'
. This topic is addressed in more detail in the Using Fonts vignette. Because of the complexity of this, the samples below are shown as they would be in the console.
skim(iris) %>% skimr::kable()
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## Variable type: factor
##
## variable missing complete n n_unique top_counts ordered
## ---------- --------- ---------- ----- ---------- ---------------------------------- ---------
## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0 FALSE
##
## Variable type: numeric
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## -------------- --------- ---------- ----- ------ ------ ----- ----- ------ ----- ------ ----------
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▁▂▅▅▃▁
## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5 ▇▁▁▅▃▃▂▂
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▂▇▅▇▆▅▂▂
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4 ▁▂▅▇▃▂▁▁
library(pander)
##
## Attaching package: 'pander'
## The following object is masked from 'package:skimr':
##
## pander
panderOptions('knitr.auto.asis', FALSE)
skim(iris) %>% pander()
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ------------------------------------------------
## variable missing complete n n_unique
## ---------- --------- ---------- ----- ----------
## Species 0 150 150 3
## ------------------------------------------------
##
## Table: Table continues below
##
##
## ------------------------------------------
## top_counts ordered
## -------------------------------- ---------
## set: 50, ver: 50, vir: 50, NA: FALSE
## 0
## ------------------------------------------
##
##
## --------------------------------------------------------------------------------
## variable missing complete n mean sd p0 p25 p50 p75
## -------------- --------- ---------- ----- ------ ------ ----- ----- ------ -----
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1
##
## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8
##
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4
##
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3
## --------------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------
## p100 hist
## ------ ----------
## 6.9 ▇▁▁▂▅▅▃▁
##
## 2.5 ▇▁▁▅▃▃▂▂
##
## 7.9 ▂▇▅▇▆▅▂▂
##
## 4.4 ▁▂▅▇▃▂▁▁
## -----------------
The details of rendering are dependent on the operating system R is running on, the locale of the installation, and the fonts installed. Rendering may also differ based on whether it occurs in the console or when knitting to specific types of documents such as HTML and PDF.
The most commonly reported problems involve rendering the spark graphs (inline histogram). Currently pander()
does not support inline_histograms
on Windows. Also, Windows does not support sparkline graphs.
In order to render the sparkgraphs in html or PDF histogram you may need to change fonts to one that supports blocks or Braille (depending on which you need). Please review the separate vignette and associated template for details.