ggbetweenstats

Indrajeet Patil

2018-05-22

The function ggstatsplot::ggbetweenstats is designed with a publication-ready box-violin plot in mind, with all statistical details included in the plot itself. We will see examples of how to use this function in this vignette.

To begin with, here are some instances where you would want to use ggbetweenstats-

Note before: The following demo uses the pipe operator (%>%), so in case you are not familiar with this operator, here is a good explanation: http://r4ds.had.co.nz/pipes.html

Between-subjects comparisons with ggbetweenstats

To illustrate how this function can be used, we will use the gapminder dataset. This dataset (available in eponymous package on CRAN) provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007, for each of 142 countries and was collected by the Gapminder Foundation. Let’s have a look at the data-

library(gapminder)
library(dplyr)

dplyr::glimpse(x = gapminder::gapminder)
#> Observations: 1,704
#> Variables: 6
#> $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
#> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
#> $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
#> $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
#> $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

Suppose the first thing we want to check is the distribution of life expectancy worldwide (across continents, i.e.) in 2007. That is, we want to see if the average (across countries in a given continent) life expectancy differs significantly across five continents.

The simplest form of the function call is-

library(ggstatsplot)
library(gapminder)

ggstatsplot::ggbetweenstats(
  data = dplyr::filter(.data = gapminder, year == 2007),
  x = continent,
  y = lifeExp,
  messages = FALSE
)

Note that the test automatically figures out whether independent t-test is to be run or an ANOVA based on the number of levels in the grouping variable. Additionally, the function output a ggplot object which means that it can be further modified.

We will see how this basic call can be further modified with additional arguments-

library(ggstatsplot)
library(gapminder)

ggstatsplot::ggbetweenstats(
  data = dplyr::filter(.data = gapminder, year == 2007),      # dataframe
  x = continent,                                              # grouping/independent variable
  y = lifeExp,                                                # dependent variables
  xlab = "Continent",                                         # label for the x-axis
  ylab = "Life expectancy",                                   # label for the y-axis
  plot.type = "boxviolin",                                    # type of plot
  type = "parametric",                                        # type of statistical test  
  effsize.type = "biased",                                    # type of effect size
  outlier.tagging = TRUE,                                     # whether outliers should be flagged
  outlier.coef = 1.5,                                         # coefficient for Tukey's rule
  outlier.label = country,                                    # label to attacht to outlier values
  outlier.label.color = "red",                                # outlier point label color
  mean.plotting = TRUE,                                       # whether the mean is to be displayed
  mean.color = "darkblue",                                    # color for mean
  messages = FALSE,                                           # turn off messages
  title = "Comparison of life expectancy across continents (Year: 2007)",
  caption = "Source: Gapminder Foundation"
) +                                                           # modifying the plot further
  ggplot2::scale_y_continuous(limits = c(35,85), breaks = seq(from = 35, to = 85, by = 5))

As can be appreciated from the effect size of 0.635, there are big differences in the observed life expectancy across continents. Importantly, this plot also helps us appreciate the differences within any given continent. For example, although Asian countries are doing much better than African countries, on average, Afghanistan has a particularly grim average for the Asian continent, possibly reflecting the war and the political turmoil.

Out of curiosity, we can repeat the same analysis with other available tests in ggstatsplot. The type (of test) argument also accepts the following abbreviations: "p" (for parametric), "np" (for nonparametric), "r" (for robust). Additionally, the type of plot to be displayed can also be modified ("box", "violin", or "boxviolin"). Let’s produce all of these variations in the plot below.

For example,

library(ggstatsplot)
library(gapminder)

# parametric ANOVA and box plot
p1 <- ggstatsplot::ggbetweenstats(
  data = dplyr::filter(.data = gapminder, year == 2007),
  x = continent,
  y = lifeExp,
  plot.type = "box",
  type = "p",
  messages = FALSE
)

# Kruskal-Wallis test (nonparametric ANOVA) and violin plot
p2 <- ggstatsplot::ggbetweenstats(
  data = dplyr::filter(.data = gapminder, year == 2007),
  x = continent,
  y = lifeExp,
  plot.type = "violin",
  type = "np",
  messages = FALSE
)

# robust ANOVA and boxviolin plot
p3 <- ggstatsplot::ggbetweenstats(
  data = dplyr::filter(.data = gapminder, year == 2007),
  x = continent,
  y = lifeExp,
  plot.type = "boxviolin",
  type = "r",
  messages = FALSE
)

# combining the individual plots into a single plot
ggstatsplot::combine_plots(
  p1, p2, p3, 
  nrow = 3, 
  ncol = 1, 
  labels = c("(a)", "(b)", "(c)"),
  title.text = "Comparison of life expectancy across continents (Year: 2007)",
  caption.text = "Note: Comparing results from parametric, non-parametric, and robust tests",
  title.size = 14,
  caption.size = 12
)

Grouped analysis with grouped_ggbetweenstats

What if we want to do the same analysis separately for each year for which the data is available, i.e. checking how the differences in life expectancy have changed since this data collection started in 1952 until 2007? In that case, we will have to either write a for loop or use purrr, both of which are time consuming and can be a bit of a struggle.

ggstatsplot provides a special helper function for such instances: grouped_ggbetweenstats. This is merely a wrapper function around ggstatsplot::combine_plots. It applies ggbetweenstats across all levels of a specified grouping variable and then combines list of individual plots into a single plot. Note that the grouping variable can be anything: conditions in a given study, groups in a study sample, different studies, etc.

Let’s focus on the following years to see these changes for every 10 years: 1957, 1967, 1977, 1987, 1997, 2007.

ggstatsplot::grouped_ggbetweenstats(
  # arguments relevant for ggstatsplot::ggbetweenstats
  data = dplyr::filter(
    .data = gapminder::gapminder,
    year == 1957 |
    year == 1967 |
    year == 1977 |
    year == 1987 |
    year == 1997 |
    year == 2007
  ),
  x = continent,
  y = lifeExp,
  outlier.tagging = TRUE,
  outlier.label = country,
  grouping.var = year,
  title.prefix = "Year",
  messages = FALSE,
  # arguments relevant for ggstatsplot::combine_plots
  title.text = "Changes in life expectancy across continents (1957-2007)",
  nrow = 6,
  ncol = 1,
  labels = c("(a)","(b)","(c)", "(d)", "(e)", "(f)")
)

As seen from the plot, although the life expectancy has been improving steadily across all continents as we go from 1957 to 2007, this improvement has not been happening at the same rate for all continents. Additionally, irrespective of which year we look at, we still find significant differences in life expectancy across continents which have been surprisingly consistent across five decades (based on the observed effect sizes).

Grouped analysis with ggbetweenstats + purrr

Although this grouping function provides a quick way to explore the data, it leaves much to be desired. For example, the same type of plot and test is applied for all years, but maybe we want to change this for different years, or maybe we want to gave different effect sizes for different years. This type of customization for different levels of a grouping variable is not possible with grouped_ggbetweenstats, but we will see how this can be easily achieved using the purrr package.

Note before: Unlike the function call so far, while using purrr::pmap, we will need to quote the arguments.

# let's split the dataframe and create a list by years of interest
year_list <- gapminder::gapminder %>%
  dplyr::filter(
    .data = .,
    year == 1957 |
    year == 1967 |
    year == 1977 |
    year == 1987 |
    year == 1997 |
    year == 2007
  ) %>%
  base::split(x = ., f = .$year, drop = TRUE)

# this created a list with 4 elements, one for each mpaa rating
str(year_list)
#> List of 6
#>  $ 1957:Classes 'tbl_df', 'tbl' and 'data.frame':    142 obs. of  6 variables:
#>   ..$ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
#>   ..$ year     : int [1:142] 1957 1957 1957 1957 1957 1957 1957 1957 1957 1957 ...
#>   ..$ lifeExp  : num [1:142] 30.3 59.3 45.7 32 64.4 ...
#>   ..$ pop      : int [1:142] 9240934 1476505 10270856 4561361 19610538 9712569 6965860 138655 51365468 8989111 ...
#>   ..$ gdpPercap: num [1:142] 821 1942 3014 3828 6857 ...
#>  $ 1967:Classes 'tbl_df', 'tbl' and 'data.frame':    142 obs. of  6 variables:
#>   ..$ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
#>   ..$ year     : int [1:142] 1967 1967 1967 1967 1967 1967 1967 1967 1967 1967 ...
#>   ..$ lifeExp  : num [1:142] 34 66.2 51.4 36 65.6 ...
#>   ..$ pop      : int [1:142] 11537966 1984060 12760499 5247469 22934225 11872264 7376998 202182 62821884 9556500 ...
#>   ..$ gdpPercap: num [1:142] 836 2760 3247 5523 8053 ...
#>  $ 1977:Classes 'tbl_df', 'tbl' and 'data.frame':    142 obs. of  6 variables:
#>   ..$ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
#>   ..$ year     : int [1:142] 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
#>   ..$ lifeExp  : num [1:142] 38.4 68.9 58 39.5 68.5 ...
#>   ..$ pop      : int [1:142] 14880372 2509048 17152804 6162675 26983828 14074100 7568430 297410 80428306 9821800 ...
#>   ..$ gdpPercap: num [1:142] 786 3533 4910 3009 10079 ...
#>  $ 1987:Classes 'tbl_df', 'tbl' and 'data.frame':    142 obs. of  6 variables:
#>   ..$ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
#>   ..$ year     : int [1:142] 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
#>   ..$ lifeExp  : num [1:142] 40.8 72 65.8 39.9 70.8 ...
#>   ..$ pop      : int [1:142] 13867957 3075321 23254956 7874230 31620918 16257249 7578903 454612 103764241 9870200 ...
#>   ..$ gdpPercap: num [1:142] 852 3739 5681 2430 9140 ...
#>  $ 1997:Classes 'tbl_df', 'tbl' and 'data.frame':    142 obs. of  6 variables:
#>   ..$ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
#>   ..$ year     : int [1:142] 1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
#>   ..$ lifeExp  : num [1:142] 41.8 73 69.2 41 73.3 ...
#>   ..$ pop      : int [1:142] 22227415 3428038 29072015 9875024 36203463 18565243 8069876 598561 123315288 10199787 ...
#>   ..$ gdpPercap: num [1:142] 635 3193 4797 2277 10967 ...
#>  $ 2007:Classes 'tbl_df', 'tbl' and 'data.frame':    142 obs. of  6 variables:
#>   ..$ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
#>   ..$ year     : int [1:142] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
#>   ..$ lifeExp  : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
#>   ..$ pop      : int [1:142] 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ...
#>   ..$ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...

# running function on every element of this list note that if you want the same
# value for a given argument across all elements of the list, you need to
# specify it just once
plot_list <- purrr::pmap(
  .l = list(
    data = year_list,
    x = "continent",
    y = "lifeExp",
    outlier.label = "country",
    outlier.label.color = list(
      "#56B4E9",
      "#009E73",
      "#F0E442",
      "#0072B2",
      "#D55E00",
      "#CC79A7"
    ),
    xlab = "Continent",
    ylab = "Life expectancy",
    title = list(
      "Year: 1957",
      "Year: 1967",
      "Year: 1977",
      "Year: 1987",
      "Year: 1997",
      "Year: 2007"
    ),
    type = list("r", "p", "np", "p", "np", "r"),
    effsize.type = list(
      "biased",
      "unbiased",
      "biased",
      "unbiased",
      "biased",
      "unbiased"
    ),
    plot.type = list("box", "boxviolin", "box", "boxviolin", "box", "violin"),
    messages = FALSE
  ),
  .f = ggstatsplot::ggbetweenstats
)
  
# combining all individual plots from the list into a single plot using combine_plots function
ggstatsplot::combine_plots(
  plotlist = plot_list,
  title.text = "Changes in life expectancy across continents (1957-2007)",
  title.color = "red",
  nrow = 6,
  ncol = 1,
  labels = c("(a)","(b)","(c)","(d)", "(e)", "(f)")
)

Within-subjects designs

Variant of this function ggwithinstats is currently under work. You can still use this function just to prepare the plot for exploratory data analysis, but the statistical details displayed in the subtitle will be incorrect. You can remove them by adding + ggplot2::labs(subtitle = NULL).

Suggestions

If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/ggstatsplot/issues