Descriptive statistics are used to summarize data. It enables us to present the data in a more meaningful way and to discern any patterns existing in the data. They can be useful for two purposes:
This document introduces you to a basic set of functions that describe data. There is a second vignette which provides details about functions which help visualize statistical distributions.
We have modified the mtcars
data to create a new data set mtcarz
. The only difference between the two data sets is related to the variable types.
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
The ds_screener()
function will screen a data set and return the following: - Column/Variable Names - Data Type - Levels (in case of categorical data) - Number of missing observations - % of missing observations
## -----------------------------------------------------------------------
## | Column Name | Data Type | Levels | Missing | Missing (%) |
## -----------------------------------------------------------------------
## | mpg | numeric | NA | 0 | 0 |
## | cyl | factor | 4 6 8 | 0 | 0 |
## | disp | numeric | NA | 0 | 0 |
## | hp | numeric | NA | 0 | 0 |
## | drat | numeric | NA | 0 | 0 |
## | wt | numeric | NA | 0 | 0 |
## | qsec | numeric | NA | 0 | 0 |
## | vs | factor | 0 1 | 0 | 0 |
## | am | factor | 0 1 | 0 | 0 |
## | gear | factor | 3 4 5 | 0 | 0 |
## | carb | factor |1 2 3 4 6 8| 0 | 0 |
## -----------------------------------------------------------------------
##
## Overall Missing Values 0
## Percentage of Missing Values 0 %
## Rows with Missing Values 0
## Columns With Missing Values 0
The ds_summary_stats
function returns a comprehensive set of statistics for continuous data.
## Univariate Analysis
##
## N 32.00 Variance 36.32
## Missing 0.00 Std Deviation 6.03
## Mean 20.09 Range 23.50
## Median 19.20 Interquartile Range 7.38
## Mode 10.40 Uncorrected SS 14042.31
## Trimmed Mean 19.95 Corrected SS 1126.05
## Skewness 0.67 Coeff Variation 30.00
## Kurtosis -0.02 Std Error Mean 1.07
##
## Quantiles
##
## Quantile Value
##
## Max 33.90
## 99% 33.44
## 95% 31.30
## 90% 30.09
## Q3 22.80
## Median 19.20
## Q1 15.43
## 10% 14.34
## 5% 12.00
## 1% 10.40
## Min 10.40
##
## Extreme Values
##
## Low High
##
## Obs Value Obs Value
## 15 10.4 20 33.9
## 16 10.4 18 32.4
## 24 13.3 19 30.4
## 7 14.3 28 30.4
## 17 14.7 26 27.3
The ds_cross_table()
function creates two way tables of categorical variables.
## Cell Contents
## |---------------|
## | Frequency |
## | Percent |
## | Row Pct |
## | Col Pct |
## |---------------|
##
## Total Observations: 32
##
## ----------------------------------------------------------------------------
## | | gear |
## ----------------------------------------------------------------------------
## | cyl | 3 | 4 | 5 | Row Total |
## ----------------------------------------------------------------------------
## | 4 | 1 | 8 | 2 | 11 |
## | | 0.031 | 0.25 | 0.062 | |
## | | 0.09 | 0.73 | 0.18 | 0.34 |
## | | 0.07 | 0.67 | 0.4 | |
## ----------------------------------------------------------------------------
## | 6 | 2 | 4 | 1 | 7 |
## | | 0.062 | 0.125 | 0.031 | |
## | | 0.29 | 0.57 | 0.14 | 0.22 |
## | | 0.13 | 0.33 | 0.2 | |
## ----------------------------------------------------------------------------
## | 8 | 12 | 0 | 2 | 14 |
## | | 0.375 | 0 | 0.062 | |
## | | 0.86 | 0 | 0.14 | 0.44 |
## | | 0.8 | 0 | 0.4 | |
## ----------------------------------------------------------------------------
## | Column Total | 15 | 12 | 5 | 32 |
## | | 0.468 | 0.375 | 0.155 | |
## ----------------------------------------------------------------------------
ds_twoway_table()
will return a tibble.
## Joining, by = c("cyl", "gear", "count")
## # A tibble: 8 x 6
## cyl gear count percent row_percent col_percent
## <fct> <fct> <int> <dbl> <dbl> <dbl>
## 1 4 3 1 0.0312 0.0909 0.0667
## 2 4 4 8 0.250 0.727 0.667
## 3 4 5 2 0.0625 0.182 0.400
## 4 6 3 2 0.0625 0.286 0.133
## 5 6 4 4 0.125 0.571 0.333
## 6 6 5 1 0.0312 0.143 0.200
## 7 8 3 12 0.375 0.857 0.800
## 8 8 5 2 0.0625 0.143 0.400
A plot method has been defined which will generate:
The ds_freq_table()
function creates frequency tables for categorical variables.
## Variable: cyl
## -----------------------------------------------------------------------
## Levels Frequency Cum Frequency Percent Cum Percent
## -----------------------------------------------------------------------
## 4 11 11 34.38 34.38
## -----------------------------------------------------------------------
## 6 7 18 21.88 56.25
## -----------------------------------------------------------------------
## 8 14 32 43.75 100
## -----------------------------------------------------------------------
## Total 32 - 100.00 -
## -----------------------------------------------------------------------
The ds_freq_cont
function creates frequency tables for continuous variables. The default number of intervals is 5.
## Variable: mpg
## |---------------------------------------------------------------------------|
## | Bins | Frequency | Cum Frequency | Percent | Cum Percent |
## |---------------------------------------------------------------------------|
## | 10.4 - 16.3 | 10 | 10 | 31.25 | 31.25 |
## |---------------------------------------------------------------------------|
## | 16.3 - 22.1 | 13 | 23 | 40.62 | 71.88 |
## |---------------------------------------------------------------------------|
## | 22.1 - 28 | 5 | 28 | 15.62 | 87.5 |
## |---------------------------------------------------------------------------|
## | 28 - 33.9 | 4 | 32 | 12.5 | 100 |
## |---------------------------------------------------------------------------|
## | Total | 32 | - | 100.00 | - |
## |---------------------------------------------------------------------------|
The ds_group_summary()
function returns descriptive statistics of a continuous variable for the different levels of a categorical variable.
## mpg by cyl
## -----------------------------------------------------------------------------------------
## | Statistic/Levels| 4| 6| 8|
## -----------------------------------------------------------------------------------------
## | Obs| 11| 7| 14|
## | Minimum| 21.4| 17.8| 10.4|
## | Maximum| 33.9| 21.4| 19.2|
## | Mean| 26.66| 19.74| 15.1|
## | Median| 26| 19.7| 15.2|
## | Mode| 22.8| 21| 10.4|
## | Std. Deviation| 4.51| 1.45| 2.56|
## | Variance| 20.34| 2.11| 6.55|
## | Skewness| 0.35| -0.26| -0.46|
## | Kurtosis| -1.43| -1.83| 0.33|
## | Uncorrected SS| 8023.83| 2741.14| 3277.34|
## | Corrected SS| 203.39| 12.68| 85.2|
## | Coeff Variation| 16.91| 7.36| 16.95|
## | Std. Error Mean| 1.36| 0.55| 0.68|
## | Range| 12.5| 3.6| 8.8|
## | Interquartile Range| 7.6| 2.35| 1.85|
## -----------------------------------------------------------------------------------------
ds_group_summary()
returns a tibble which can be used for further analysis.
## # A tibble: 3 x 15
## cyl length min max mean median mode sd variance skewness
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 11 21.4 33.9 26.7 26.0 22.8 4.51 20.3 0.348
## 2 6 7 17.8 21.4 19.7 19.7 21.0 1.45 2.11 -0.259
## 3 8 14 10.4 19.2 15.1 15.2 10.4 2.56 6.55 -0.456
## # ... with 5 more variables: kurtosis <dbl>, coeff_var <dbl>,
## # std_error <dbl>, range <dbl>, iqr <dbl>
The ds_multi_stats()
function generates summary/descriptive statistics for variables in a data frame/tibble.
## # A tibble: 3 x 16
## vars min max mean t_mean median mode range variance stdev skew
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 disp 71.1 472 231 228 196 276 401 15361 124 0.420
## 2 hp 52.0 335 147 144 123 110 283 4701 68.6 0.799
## 3 mpg 10.4 33.9 20.1 20.0 19.2 10.4 23.5 36.3 6.03 0.672
## # ... with 5 more variables: kurtosis <dbl>, coeff_var <dbl>, q1 <dbl>,
## # q3 <dbl>, iqrange <dbl>
The ds_oway_tables()
function creates multiple one way tables by creating a frequency table for each categorical variable in a data frame.
## Variable: cyl
## -----------------------------------------------------------------------
## Levels Frequency Cum Frequency Percent Cum Percent
## -----------------------------------------------------------------------
## 4 11 11 34.38 34.38
## -----------------------------------------------------------------------
## 6 7 18 21.88 56.25
## -----------------------------------------------------------------------
## 8 14 32 43.75 100
## -----------------------------------------------------------------------
## Total 32 - 100.00 -
## -----------------------------------------------------------------------
##
## Variable: vs
## -----------------------------------------------------------------------
## Levels Frequency Cum Frequency Percent Cum Percent
## -----------------------------------------------------------------------
## 0 18 18 56.25 56.25
## -----------------------------------------------------------------------
## 1 14 32 43.75 100
## -----------------------------------------------------------------------
## Total 32 - 100.00 -
## -----------------------------------------------------------------------
##
## Variable: am
## -----------------------------------------------------------------------
## Levels Frequency Cum Frequency Percent Cum Percent
## -----------------------------------------------------------------------
## 0 19 19 59.38 59.38
## -----------------------------------------------------------------------
## 1 13 32 40.62 100
## -----------------------------------------------------------------------
## Total 32 - 100.00 -
## -----------------------------------------------------------------------
##
## Variable: gear
## -----------------------------------------------------------------------
## Levels Frequency Cum Frequency Percent Cum Percent
## -----------------------------------------------------------------------
## 3 15 15 46.88 46.88
## -----------------------------------------------------------------------
## 4 12 27 37.5 84.38
## -----------------------------------------------------------------------
## 5 5 32 15.62 100
## -----------------------------------------------------------------------
## Total 32 - 100.00 -
## -----------------------------------------------------------------------
##
## Variable: carb
## -----------------------------------------------------------------------
## Levels Frequency Cum Frequency Percent Cum Percent
## -----------------------------------------------------------------------
## 1 7 7 21.88 21.88
## -----------------------------------------------------------------------
## 2 10 17 31.25 53.12
## -----------------------------------------------------------------------
## 3 3 20 9.38 62.5
## -----------------------------------------------------------------------
## 4 10 30 31.25 93.75
## -----------------------------------------------------------------------
## 6 1 31 3.12 96.88
## -----------------------------------------------------------------------
## 8 1 32 3.12 100
## -----------------------------------------------------------------------
## Total 32 - 100.00 -
## -----------------------------------------------------------------------
The ds_tway_tables()
function creates multiple two way tables by creating a cross table for each unique pair of categorical variables in a data frame.
## Cell Contents
## |---------------|
## | Frequency |
## | Percent |
## | Row Pct |
## | Col Pct |
## |---------------|
##
## Total Observations: 32
##
## cyl vs vs
## -------------------------------------------------------------
## | | vs |
## -------------------------------------------------------------
## | cyl | 0 | 1 | Row Total |
## -------------------------------------------------------------
## | 4 | 1 | 10 | 11 |
## | | 0.031 | 0.312 | |
## | | 0.09 | 0.91 | 0.34 |
## | | 0.06 | 0.71 | |
## -------------------------------------------------------------
## | 6 | 3 | 4 | 7 |
## | | 0.094 | 0.125 | |
## | | 0.43 | 0.57 | 0.22 |
## | | 0.17 | 0.29 | |
## -------------------------------------------------------------
## | 8 | 14 | 0 | 14 |
## | | 0.438 | 0 | |
## | | 1 | 0 | 0.44 |
## | | 0.78 | 0 | |
## -------------------------------------------------------------
## | Column Total | 18 | 14 | 32 |
## | | 0.563 | 0.437 | |
## -------------------------------------------------------------
##
##
## cyl vs am
## -------------------------------------------------------------
## | | am |
## -------------------------------------------------------------
## | cyl | 0 | 1 | Row Total |
## -------------------------------------------------------------
## | 4 | 3 | 8 | 11 |
## | | 0.094 | 0.25 | |
## | | 0.27 | 0.73 | 0.34 |
## | | 0.16 | 0.62 | |
## -------------------------------------------------------------
## | 6 | 4 | 3 | 7 |
## | | 0.125 | 0.094 | |
## | | 0.57 | 0.43 | 0.22 |
## | | 0.21 | 0.23 | |
## -------------------------------------------------------------
## | 8 | 12 | 2 | 14 |
## | | 0.375 | 0.062 | |
## | | 0.86 | 0.14 | 0.44 |
## | | 0.63 | 0.15 | |
## -------------------------------------------------------------
## | Column Total | 19 | 13 | 32 |
## | | 0.594 | 0.406 | |
## -------------------------------------------------------------
##
##
## cyl vs gear
## ----------------------------------------------------------------------------
## | | gear |
## ----------------------------------------------------------------------------
## | cyl | 3 | 4 | 5 | Row Total |
## ----------------------------------------------------------------------------
## | 4 | 1 | 8 | 2 | 11 |
## | | 0.031 | 0.25 | 0.062 | |
## | | 0.09 | 0.73 | 0.18 | 0.34 |
## | | 0.07 | 0.67 | 0.4 | |
## ----------------------------------------------------------------------------
## | 6 | 2 | 4 | 1 | 7 |
## | | 0.062 | 0.125 | 0.031 | |
## | | 0.29 | 0.57 | 0.14 | 0.22 |
## | | 0.13 | 0.33 | 0.2 | |
## ----------------------------------------------------------------------------
## | 8 | 12 | 0 | 2 | 14 |
## | | 0.375 | 0 | 0.062 | |
## | | 0.86 | 0 | 0.14 | 0.44 |
## | | 0.8 | 0 | 0.4 | |
## ----------------------------------------------------------------------------
## | Column Total | 15 | 12 | 5 | 32 |
## | | 0.468 | 0.375 | 0.155 | |
## ----------------------------------------------------------------------------
##
##
## cyl vs carb
## -------------------------------------------------------------------------------------------------------------------------
## | | carb |
## -------------------------------------------------------------------------------------------------------------------------
## | cyl | 1 | 2 | 3 | 4 | 6 | 8 | Row Total |
## -------------------------------------------------------------------------------------------------------------------------
## | 4 | 5 | 6 | 0 | 0 | 0 | 0 | 11 |
## | | 0.156 | 0.188 | 0 | 0 | 0 | 0 | |
## | | 0.45 | 0.55 | 0 | 0 | 0 | 0 | 0.34 |
## | | 0.71 | 0.6 | 0 | 0 | 0 | 0 | |
## -------------------------------------------------------------------------------------------------------------------------
## | 6 | 2 | 0 | 0 | 4 | 1 | 0 | 7 |
## | | 0.062 | 0 | 0 | 0.125 | 0.031 | 0 | |
## | | 0.29 | 0 | 0 | 0.57 | 0.14 | 0 | 0.22 |
## | | 0.29 | 0 | 0 | 0.4 | 1 | 0 | |
## -------------------------------------------------------------------------------------------------------------------------
## | 8 | 0 | 4 | 3 | 6 | 0 | 1 | 14 |
## | | 0 | 0.125 | 0.094 | 0.188 | 0 | 0.031 | |
## | | 0 | 0.29 | 0.21 | 0.43 | 0 | 0.07 | 0.44 |
## | | 0 | 0.4 | 1 | 0.6 | 0 | 1 | |
## -------------------------------------------------------------------------------------------------------------------------
## | Column Total | 7 | 10 | 3 | 10 | 1 | 1 | 32 |
## | | 0.218 | 0.313 | 0.094 | 0.313 | 0.031 | 0.031 | |
## -------------------------------------------------------------------------------------------------------------------------
##
##
## vs vs am
## -------------------------------------------------------------
## | | am |
## -------------------------------------------------------------
## | vs | 0 | 1 | Row Total |
## -------------------------------------------------------------
## | 0 | 12 | 6 | 18 |
## | | 0.375 | 0.188 | |
## | | 0.67 | 0.33 | 0.56 |
## | | 0.63 | 0.46 | |
## -------------------------------------------------------------
## | 1 | 7 | 7 | 14 |
## | | 0.219 | 0.219 | |
## | | 0.5 | 0.5 | 0.44 |
## | | 0.37 | 0.54 | |
## -------------------------------------------------------------
## | Column Total | 19 | 13 | 32 |
## | | 0.594 | 0.407 | |
## -------------------------------------------------------------
##
##
## vs vs gear
## ----------------------------------------------------------------------------
## | | gear |
## ----------------------------------------------------------------------------
## | vs | 3 | 4 | 5 | Row Total |
## ----------------------------------------------------------------------------
## | 0 | 12 | 2 | 4 | 18 |
## | | 0.375 | 0.062 | 0.125 | |
## | | 0.67 | 0.11 | 0.22 | 0.56 |
## | | 0.8 | 0.17 | 0.8 | |
## ----------------------------------------------------------------------------
## | 1 | 3 | 10 | 1 | 14 |
## | | 0.094 | 0.312 | 0.031 | |
## | | 0.21 | 0.71 | 0.07 | 0.44 |
## | | 0.2 | 0.83 | 0.2 | |
## ----------------------------------------------------------------------------
## | Column Total | 15 | 12 | 5 | 32 |
## | | 0.469 | 0.374 | 0.156 | |
## ----------------------------------------------------------------------------
##
##
## vs vs carb
## -------------------------------------------------------------------------------------------------------------------------
## | | carb |
## -------------------------------------------------------------------------------------------------------------------------
## | vs | 1 | 2 | 3 | 4 | 6 | 8 | Row Total |
## -------------------------------------------------------------------------------------------------------------------------
## | 0 | 0 | 5 | 3 | 8 | 1 | 1 | 18 |
## | | 0 | 0.156 | 0.094 | 0.25 | 0.031 | 0.031 | |
## | | 0 | 0.28 | 0.17 | 0.44 | 0.06 | 0.06 | 0.56 |
## | | 0 | 0.5 | 1 | 0.8 | 1 | 1 | |
## -------------------------------------------------------------------------------------------------------------------------
## | 1 | 7 | 5 | 0 | 2 | 0 | 0 | 14 |
## | | 0.219 | 0.156 | 0 | 0.062 | 0 | 0 | |
## | | 0.5 | 0.36 | 0 | 0.14 | 0 | 0 | 0.44 |
## | | 1 | 0.5 | 0 | 0.2 | 0 | 0 | |
## -------------------------------------------------------------------------------------------------------------------------
## | Column Total | 7 | 10 | 3 | 10 | 1 | 1 | 32 |
## | | 0.219 | 0.312 | 0.094 | 0.312 | 0.031 | 0.031 | |
## -------------------------------------------------------------------------------------------------------------------------
##
##
## am vs gear
## ----------------------------------------------------------------------------
## | | gear |
## ----------------------------------------------------------------------------
## | am | 3 | 4 | 5 | Row Total |
## ----------------------------------------------------------------------------
## | 0 | 15 | 4 | 0 | 19 |
## | | 0.469 | 0.125 | 0 | |
## | | 0.79 | 0.21 | 0 | 0.59 |
## | | 1 | 0.33 | 0 | |
## ----------------------------------------------------------------------------
## | 1 | 0 | 8 | 5 | 13 |
## | | 0 | 0.25 | 0.156 | |
## | | 0 | 0.62 | 0.38 | 0.41 |
## | | 0 | 0.67 | 1 | |
## ----------------------------------------------------------------------------
## | Column Total | 15 | 12 | 5 | 32 |
## | | 0.469 | 0.375 | 0.156 | |
## ----------------------------------------------------------------------------
##
##
## am vs carb
## -------------------------------------------------------------------------------------------------------------------------
## | | carb |
## -------------------------------------------------------------------------------------------------------------------------
## | am | 1 | 2 | 3 | 4 | 6 | 8 | Row Total |
## -------------------------------------------------------------------------------------------------------------------------
## | 0 | 3 | 6 | 3 | 7 | 0 | 0 | 19 |
## | | 0.094 | 0.188 | 0.094 | 0.219 | 0 | 0 | |
## | | 0.16 | 0.32 | 0.16 | 0.37 | 0 | 0 | 0.6 |
## | | 0.43 | 0.6 | 1 | 0.7 | 0 | 0 | |
## -------------------------------------------------------------------------------------------------------------------------
## | 1 | 4 | 4 | 0 | 3 | 1 | 1 | 13 |
## | | 0.125 | 0.125 | 0 | 0.094 | 0.031 | 0.031 | |
## | | 0.31 | 0.31 | 0 | 0.23 | 0.08 | 0.08 | 0.41 |
## | | 0.57 | 0.4 | 0 | 0.3 | 1 | 1 | |
## -------------------------------------------------------------------------------------------------------------------------
## | Column Total | 7 | 10 | 3 | 10 | 1 | 1 | 32 |
## | | 0.219 | 0.313 | 0.094 | 0.313 | 0.031 | 0.031 | |
## -------------------------------------------------------------------------------------------------------------------------
##
##
## gear vs carb
## -------------------------------------------------------------------------------------------------------------------------
## | | carb |
## -------------------------------------------------------------------------------------------------------------------------
## | gear | 1 | 2 | 3 | 4 | 6 | 8 | Row Total |
## -------------------------------------------------------------------------------------------------------------------------
## | 3 | 3 | 4 | 3 | 5 | 0 | 0 | 15 |
## | | 0.094 | 0.125 | 0.094 | 0.156 | 0 | 0 | |
## | | 0.2 | 0.27 | 0.2 | 0.33 | 0 | 0 | 0.47 |
## | | 0.43 | 0.4 | 1 | 0.5 | 0 | 0 | |
## -------------------------------------------------------------------------------------------------------------------------
## | 4 | 4 | 4 | 0 | 4 | 0 | 0 | 12 |
## | | 0.125 | 0.125 | 0 | 0.125 | 0 | 0 | |
## | | 0.33 | 0.33 | 0 | 0.33 | 0 | 0 | 0.38 |
## | | 0.57 | 0.4 | 0 | 0.4 | 0 | 0 | |
## -------------------------------------------------------------------------------------------------------------------------
## | 5 | 0 | 2 | 0 | 1 | 1 | 1 | 5 |
## | | 0 | 0.062 | 0 | 0.031 | 0.031 | 0.031 | |
## | | 0 | 0.4 | 0 | 0.2 | 0.2 | 0.2 | 0.16 |
## | | 0 | 0.2 | 0 | 0.1 | 1 | 1 | |
## -------------------------------------------------------------------------------------------------------------------------
## | Column Total | 7 | 10 | 3 | 10 | 1 | 1 | 32 |
## | | 0.219 | 0.312 | 0.094 | 0.312 | 0.031 | 0.031 | |
## -------------------------------------------------------------------------------------------------------------------------