The function ggstatsplot::ggpiestats
can be used for quick data exploration and/or to prepare publication-ready pie charts to summarize the statistical relationship between two categorical variables. We will see examples of how to use this function in this vignette.
To begin with, here are some instances where you would want to use ggpiestats
-
Note before: The following demo uses the pipe operator (%>%
), so in case you are not familiar with this operator, here is a good explanation: http://r4ds.had.co.nz/pipes.html
ggpiestats
To demonstrate how ggpiestats
can be used to we will be using the Titanic
dataset that is included in the datasets
library. Titanic Passenger Survival Data Set provides information “on the fate of passengers on the fatal maiden voyage of the ocean liner Titanic, summarized according to economic status (class), sex, age, and survival.”
Let’s have a look at the structure of this table and also convert it into a tibble while we are at it.
library(datasets)
library(dplyr)
# looking at the table
dplyr::glimpse(x = Titanic)
#> 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
#> - attr(*, "dimnames")=List of 4
#> ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
#> ..$ Sex : chr [1:2] "Male" "Female"
#> ..$ Age : chr [1:2] "Child" "Adult"
#> ..$ Survived: chr [1:2] "No" "Yes"
Note that the last column in this dataframe contains count information, which means we will have to modify it to reflect this count structure. This has already been carried out and the final dataset is included in the ggstatsplot
package in Titanic_full
. This is not necessary as ggpiestats
can handle table structures as well (see examples below).
Let’s have a look at this dataset.
library(ggstatsplot)
# looking at the final dataset
dplyr::glimpse(ggstatsplot::Titanic_full)
#> Observations: 2,201
#> Variables: 5
#> $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
#> $ Class <fct> 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd...
#> $ Sex <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male,...
#> $ Age <fct> Child, Child, Child, Child, Child, Child, Child, Chil...
#> $ Survived <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N...
First, let’s see if the proportion of people who survived was different between sexes using ggpiestats
.
# since effect size confidence intervals are computed using bootstrapping, let's
# set seed for reproducibility
set.seed(123)
ggstatsplot::ggpiestats(data = ggstatsplot::Titanic_full,
condition = Sex,
main = Survived)
A number of arguments can be modified to change the appearance of this plot:
library(ggplot2)
# for reproducibility
set.seed(123)
ggstatsplot::ggpiestats(
data = ggstatsplot::Titanic_full, # dataframe (matrix or table will not be accepted)
main = Survived, # rows in the contingency table
condition = Sex, # columns in the contingecy table
title = "Passengar survival by gender", # title for the entire plot
caption = "Source: Titanic survival dataset", # caption for the entire plot
legend.title = "Survived?", # legend title
facet.wrap.name = "Gender", # changing the facet wrap title
facet.proptest = TRUE, # proportion test for each facet
stat.title = "survival x gender: ", # title for statistical test
palette = "Set1", # changing the color palette
ggtheme = ggplot2::theme_classic() # changing plot theme
) + # further modification with ggplot2 commands
ggplot2::theme(plot.subtitle = ggplot2::element_text(
color = "black",
size = 10,
face = "bold",
hjust = 0.5
))
As seen from this plot, the Pearson’s chi-square test of independence shows that the distribution of survival was different across males and females. Additionally, among both males and females, the proportion of survival was not equally likely (at 50%, i.e.), as shown by significant results (***
) from one-sample proportion tests for each facet.
In case the condition
argument is not specified, instead of chi-square test of independence, a proportion test will be carried out. For example, let’s see if there were equal proportions of different age groups.
As this plot shows there were overwhelmingly more number of adults than children on the boat and the proportion test attests to this.
grouped_ggpiestats
What if we want to do the same analysis separately for the four different Class on the Titanic (1st, 2nd, 3rd, Crew), i.e. checking how the survival-by-gender interaction changes by the passenger class in which the people were traveling? In that case, we will have to either write a for
loop or use purrr
, both of which are time consuming and can be a bit of a struggle.
ggstatsplot
provides a special helper function for such instances: grouped_ggpiestats
. This is merely a wrapper function around ggstatsplot::combine_plots
. It applies ggpiestats
across all levels of a specified grouping variable and then combines list of individual plots into a single plot. Note that the grouping variable can be anything: conditions in a given study, groups in a study sample, different studies, etc.
ggstatsplot::grouped_ggpiestats(
# arguments relevant for ggstatsplot::gghistostats
data = ggstatsplot::Titanic_full,
grouping.var = Class,
title.prefix = "Passenger class",
stat.title = "survival x gender: ",
main = Survived,
condition = Sex,
# arguments relevant for ggstatsplot::combine_plots
title.text = "Survival in Titanic disaster by gender for all passenger classes",
caption.text = "Asterisks denote results from proportion tests; ***: p < 0.001, ns: non-significant",
nrow = 4,
ncol = 1,
labels = c("(a)","(b)","(c)", "(d)")
)
As seen from this quick exploratory analysis, across all passenger classes, the proportion of survived to non-survived individuals differed across genders: Men were more likely to perish than survive, whereas women were more likely to survive than perish. The only exception was the 3rd Class passengers where women were as likely to survive as to perish.
This will work even if the condition argument is not specified:
ggstatsplot::grouped_ggpiestats(
data = ggstatsplot::Titanic_full,
main = Age,
grouping.var = Class,
title.prefix = "Passenger Class"
)
#> Warning: Proportion test will not be run because it requires Age to have at least 2 levels with non-zero frequencies.
ggpiestats
+ purrr
Although this grouping function provides a quick way to explore the data, it leaves much to be desired. For example, we may want to add different captions, titles, themes, or palettes for each level of the grouping variable, etc. For cases like these, it would be better to use (e.g.).
Note before: Unlike the function call so far, while using purrr::pmap
, we will need to quote the arguments.
# let's split the dataframe and create a list by passenger class
class_list <- ggstatsplot::Titanic_full %>%
base::split(x = ., f = .$Class, drop = TRUE)
# this created a list with 4 elements, one for each class
# you can check the structure of the file for yourself
# str(class_list)
# checking the length and names of each element
length(class_list)
#> [1] 4
names(class_list)
#> [1] "1st" "2nd" "3rd" "Crew"
# running function on every element of this list note that if you want the same
# value for a given argument across all elements of the list, you need to
# specify it just once
plot_list <- purrr::pmap(
.l = list(
data = class_list,
main = "Survived",
condition = "Sex",
facet.wrap.name = "Gender",
title = list(
"Passenger class: 1st",
"Passenger class: 2nd",
"Passenger class: 3rd",
"Passenger class: Crew"
),
caption = list(
"Total: 319, Died: 120, Survived: 199, % Survived: 62%",
"Total: 272, Died: 155, Survived: 117, % Survived: 43%",
"Total: 709, Died: 537, Survived: 172, % Survived: 25%",
"Not available"
),
palette = list("Accent", "Paired", "Pastel1", "Set2"),
ggtheme = list(
ggplot2::theme_grey(),
ggplot2::theme_classic(),
ggplot2::theme_light(),
ggplot2::theme_minimal()
),
sample.size.label = list(TRUE, FALSE, TRUE, FALSE),
messages = FALSE
),
.f = ggstatsplot::ggpiestats
)
# combining all individual plots from the list into a single plot using combine_plots function
ggstatsplot::combine_plots(
plotlist = plot_list,
title.text = "Survival in Titanic disaster by gender for all passenger classes",
caption.text = "Asterisks denote results from proportion tests; ***: p < 0.001, ns: non-significant",
nrow = 4,
ncol = 1,
labels = c("(a)","(b)","(c)", "(d)")
)
As can be appreciated from this example, although grouped_ggpiestats
provides a quick way to explore data, purrr::pmap
lets us utilize the full functionality of this function and ggplot2
.
counts
dataggpiestats
can also work with dataframe containing counts (aka tabled data), i.e., when each row doesn’t correspond to a unique observation. For example, consider the following fishing
dataframe containing data from two boats (A
and B
) about the number of different types fish they caught in the months of February
and March
. In this dataframe, each row doesn’t equal a unique observation. In such cases, we can use counts
argument. Let’s say we want to investigate if the frequency of different types of fish caught differs across the two months:
# for reproducibility
set.seed(123)
# creating a dataframe
# (this is completely fictional; I don't know first thing about fishing!)
(
fishing <- data.frame(
Boat = c(rep("B", 4), rep("A", 4), rep("A", 4), rep("B", 4)),
Month = c(rep("February", 2), rep("March", 2), rep("February", 2), rep("March", 2)),
Fish = c(
"Bass",
"Catfish",
"Cod",
"Haddock",
"Cod",
"Haddock",
"Bass",
"Catfish",
"Bass",
"Catfish",
"Cod",
"Haddock",
"Cod",
"Haddock",
"Bass",
"Catfish"
),
SumOfCaught = c(25, 20, 35, 40, 40, 25, 30, 42, 40, 30, 33, 26, 100, 30, 20, 20)
) %>% # converting to a tibble dataframe
tibble::as_data_frame(x = .)
)
#> # A tibble: 16 x 4
#> Boat Month Fish SumOfCaught
#> <fct> <fct> <fct> <dbl>
#> 1 B February Bass 25
#> 2 B February Catfish 20
#> 3 B March Cod 35
#> 4 B March Haddock 40
#> 5 A February Cod 40
#> 6 A February Haddock 25
#> 7 A March Bass 30
#> 8 A March Catfish 42
#> 9 A February Bass 40
#> 10 A February Catfish 30
#> 11 A March Cod 33
#> 12 A March Haddock 26
#> 13 B February Cod 100
#> 14 B February Haddock 30
#> 15 B March Bass 20
#> 16 B March Catfish 20
# running `ggpiestats` with counts information
ggstatsplot::ggpiestats(
data = fishing,
main = Fish,
condition = Month,
counts = SumOfCaught
)
We just verified that the frequency of different types of fish caught differs across the two months, but what if we further want to know if this difference is present for both boats (since they are fishing in different parts of the sea)? For this, we can again utilize grouped_ggpiestats
:
# running the grouped variant of the function
ggstatsplot::grouped_ggpiestats(
data = fishing,
main = Fish,
condition = Month,
counts = SumOfCaught,
grouping.var = Boat,
title.prefix = "Boat",
nrow = 2
)
As seen from these charts, this difference is found only for the location in which the boat B
has been fishing. Additionally, faceted proportion tests also reveal that all fish are not equally likely to be caught at this location.
In case of within-subjects designs, you can set paired = TRUE
, which will display results from McNemar test in the subtitle.
# seed for reproducibility
set.seed(123)
# data
clinical_trial <-
tibble::tribble(
~Control, ~Case, ~pairs,
"No", "Yes", 25,
"Yes", "No", 4,
"Yes", "Yes", 13,
"No", "No", 92
)
# plot
ggstatsplot::ggpiestats(data = clinical_trial,
condition = Control,
main = Case,
counts = pairs,
paired = TRUE,
stat.title = "McNemar test: ",
title = "Results from case-control study",
palette = "Accent")
If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/ggstatsplot/issues