ggpiestats

2018-05-22

The function ggstatsplot::ggpiestats can be used to prepare publication-ready pie charts to summarize the statistical relationship between two categorical variables. We will see examples of how to use this function in this vignette.

To begin with, here are some instances where you would want to use ggpiestats-

• to see if the frequency distribution of two categorial variables are independent of each other using the contingency table analysis
• to check if the proportion of observations at each level of a categorial variable is equal

Note before: The following demo uses the pipe operator (%>%), so in case you are not familiar with this operator, here is a good explanation: http://r4ds.had.co.nz/pipes.html

Statistical independence of categorical variables with ggpiestats

To demonstrate how ggpiestats can be used to we will be using the Titanic dataset that is included in the datasets library. Titanic Passenger Survival Data Set provides information “on the fate of passengers on the fatal maiden voyage of the ocean liner Titanic, summarized according to economic status (class), sex, age, and survival.”

Let’s have a look at the structure of this table and also convert it into a tibble while we are at it.

library(datasets)
library(dplyr)

# looking at the table
dplyr::glimpse(x = Titanic)
#>  table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
#>  - attr(*, "dimnames")=List of 4
#>   ..$Class : chr [1:4] "1st" "2nd" "3rd" "Crew" #> ..$ Sex     : chr [1:2] "Male" "Female"
#>   ..$Age : chr [1:2] "Child" "Adult" #> ..$ Survived: chr [1:2] "No" "Yes"

# converting to tibble
tibble::as_data_frame(x = Titanic)
#> # A tibble: 32 x 5
#>    Class Sex    Age   Survived     n
#>    <chr> <chr>  <chr> <chr>    <dbl>
#>  1 1st   Male   Child No           0
#>  2 2nd   Male   Child No           0
#>  3 3rd   Male   Child No          35
#>  4 Crew  Male   Child No           0
#>  5 1st   Female Child No           0
#>  6 2nd   Female Child No           0
#>  7 3rd   Female Child No          17
#>  8 Crew  Female Child No           0
#>  9 1st   Male   Adult No         118
#> 10 2nd   Male   Adult No         154
#> # ... with 22 more rows

Note that the last column in this dataframe contains count information, which means we will have to modify it to reflect this count structure.


# a custom function to repeat dataframe rep number of times, which is going to
# be count data for us
rep_df <- function(df, rep) {
df[rep(1:nrow(df), rep), ]
}

# converting dataframe to full length based on count information
Titanic_full <- tibble::as_data_frame(datasets::Titanic) %>%
tibble::rowid_to_column(df = ., var = "id") %>%
dplyr::mutate_at(
.tbl = .,
.vars = dplyr::vars("id"),
.funs = ~ as.factor(.)
) %>%
base::split(x = ., f = .$id) %>% purrr::map_dfr(.x = ., .f = ~ rep_df(df = ., rep = .$n)) %>%
dplyr::mutate_at(
.tbl = .,
.vars = dplyr::vars("id"),
.funs = ~ as.numeric(as.character(.))
) %>%
dplyr::mutate_if(
.tbl = .,
.predicate = is.character,
.funs = ~ base::as.factor(.)
) %>%
dplyr::mutate_if(
.tbl = .,
.predicate = is.factor,
.funs = ~ base::droplevels(.)
) %>%
dplyr::arrange(.data = ., id)

# reordering the Class variables
Titanic_full$Class <- base::factor(x = Titanic_full$Class,
levels = c("1st", "2nd", "3rd", "Crew", ordered = TRUE))

# looking at the final dataset
dplyr::glimpse(Titanic_full)
#> Observations: 2,201
#> Variables: 6
#> $id <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,... #>$ Class    <fct> 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd...
#> $Sex <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male,... #>$ Age      <fct> Child, Child, Child, Child, Child, Child, Child, Chil...
#> $Survived <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N... #>$ n        <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 3...

Notice that our final dataset contains information about over 2000 passangers, while the original dataset with counts information had only 32 rows. Now we can start asking interesting question from this dataset.

First, let’s see if the proportion of people who survived was different between sexes using ggpiestats.

ggstatsplot::ggpiestats(data = Titanic_full,
condition = Sex,
main = Survived) 

A number of arguments can be modified to change the appearance of this plot:

library(ggstatsplot)

ggstatsplot::ggpiestats(
data = Titanic_full,                          # dataframe
main = Survived,                              # rows in the contingency table
condition = Sex,                              # columns in the contingecy table
title = "Passengar survival by gender",       # title for the entire plot
caption = "Source: Titanic survival dataset", # caption for the entire plot
legend.title = "Survived?",                   # legend title
facet.wrap.name = "Gender",                   # changing the facet wrap title
facet.proptest = TRUE,                        # proportion test for each facet
stat.title = "survival x gender"              # title for statistical test
) +                                             # further modification outside of ggstatsplot
ggplot2::scale_fill_brewer(palette = "Dark2")

As seen from this plot, the Pearson’s chi-square test of independence shows that the distribution of survival was different across males and females. Additionally, among both males and females, the proportion of survival was not equally likely (at 50%, i.e.), as shown by significant results (***) from one-sample proportion tests for each facet.

In case the condition argument is not specified, instead of chi-square test of independence, a proportion test will be carried out. For example, let’s see if there were equal proportions of different age groups.

library(ggstatsplot)

ggstatsplot::ggpiestats(
data = Titanic_full,
main = Age
) +
ggplot2::scale_fill_brewer(palette = "Set2")

As this plot shows there were overwhelmingly more number of adults than children on the boat and the proportion test attests to this.

Grouped analysis with grouped_ggpiestats

What if we want to do the same analysis separately for the four different Class on the Titanic (1st, 2nd, 3rd, Crew), i.e. checking how the survival-by-gender interaction changes by the passenger class in which the people were traveling? In that case, we will have to either write a for loop or use purrr, both of which are time consuming and can be a bit of a struggle.

ggstatsplot provides a special helper function for such instances: grouped_ggpiestats. This is merely a wrapper function around ggstatsplot::combine_plots. It applies ggpiestats across all levels of a specified grouping variable and then combines list of individual plots into a single plot. Note that the grouping variable can be anything: conditions in a given study, groups in a study sample, different studies, etc.

library(ggstatsplot)

ggstatsplot::grouped_ggpiestats(
# arguments relevant for ggstatsplot::gghistostats
data = Titanic_full,
grouping.var = Class,
title.prefix = "Passenger class",
stat.title = "survival x gender",
main = Survived,
condition = Sex,
# arguments relevant for ggstatsplot::combine_plots
title.text = "Survival in Titanic disaster by gender for all passenger classes",
caption.text = "Asterisks denote results from proportion tests; ***: p < 0.001, ns: non-significant",
nrow = 4,
ncol = 1,
labels = c("(a)","(b)","(c)", "(d)")
)

As seen from this quick exploratory analysis, across all passenger classes, the proportion of survived to non-survived individuals differed across genders: Men were more likely to perish than survive, whereas women were more likely to survive than perish. The only exception was the 3rd Class passangers where women were as likely to survive as to perish.

This will work even if the condition argument is not specified:

library(ggstatsplot)

ggstatsplot::grouped_ggpiestats(
data = Titanic_full,
main = Age,
grouping.var = Class
) 

Grouped analysis with ggpiestats + purrr

Although this grouping function provides a quick way to explore the data, it leaves much to be desired. For example, the color palette can’t be further modified, the legend color combination differs across different levels of the grouping variable based on the frequencies, etc. For cases like these, it would be better to use a function like purrr::pmap.

Note before: Unlike the function call so far, while using purrr::pmap, we will need to quote the arguments.

# let's split the dataframe and create a list by passenger class
class_list <- Titanic_full %>%
base::split(x = ., f = .$Class, drop = TRUE) # this created a list with 4 elements, one for each class str(class_list) #> List of 4 #>$ 1st :Classes 'tbl_df', 'tbl' and 'data.frame':    325 obs. of  6 variables:
#>   ..$id : num [1:325] 9 9 9 9 9 9 9 9 9 9 ... #> ..$ Class   : Factor w/ 5 levels "1st","2nd","3rd",..: 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ... #> ..$ Age     : Factor w/ 2 levels "Adult","Child": 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ... #> ..$ n       : num [1:325] 118 118 118 118 118 118 118 118 118 118 ...
#>  $2nd :Classes 'tbl_df', 'tbl' and 'data.frame': 285 obs. of 6 variables: #> ..$ id      : num [1:285] 10 10 10 10 10 10 10 10 10 10 ...
#>   ..$Class : Factor w/ 5 levels "1st","2nd","3rd",..: 2 2 2 2 2 2 2 2 2 2 ... #> ..$ Sex     : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
#>   ..$Age : Factor w/ 2 levels "Adult","Child": 1 1 1 1 1 1 1 1 1 1 ... #> ..$ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$n : num [1:285] 154 154 154 154 154 154 154 154 154 154 ... #>$ 3rd :Classes 'tbl_df', 'tbl' and 'data.frame':    706 obs. of  6 variables:
#>   ..$id : num [1:706] 3 3 3 3 3 3 3 3 3 3 ... #> ..$ Class   : Factor w/ 5 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 3 3 3 3 ...
#>   ..$Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ... #> ..$ Age     : Factor w/ 2 levels "Adult","Child": 2 2 2 2 2 2 2 2 2 2 ...
#>   ..$Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ... #> ..$ n       : num [1:706] 35 35 35 35 35 35 35 35 35 35 ...
#>  $Crew:Classes 'tbl_df', 'tbl' and 'data.frame': 885 obs. of 6 variables: #> ..$ id      : num [1:885] 12 12 12 12 12 12 12 12 12 12 ...
#>   ..$Class : Factor w/ 5 levels "1st","2nd","3rd",..: 4 4 4 4 4 4 4 4 4 4 ... #> ..$ Sex     : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
#>   ..$Age : Factor w/ 2 levels "Adult","Child": 1 1 1 1 1 1 1 1 1 1 ... #> ..$ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$n : num [1:885] 670 670 670 670 670 670 670 670 670 670 ... # running function on every element of this list note that if you want the same # value for a given argument across all elements of the list, you need to # specify it just once plot_list <- purrr::pmap( .l = list( data = class_list, main = "Survived", condition = "Sex", facet.wrap.name = "Gender", title = list( "Passenger class: 1st", "Passenger class: 2nd", "Passenger class: 3rd", "Passenger class: Crew" ), caption = list( "Total: 319, Died: 120, Survived: 199, % Survived: 62%", "Total: 272, Died: 155, Survived: 117, % Survived: 43%", "Total: 709, Died: 537, Survived: 172, % Survived: 25%", "Not available" ), messages = FALSE ), .f = ggstatsplot::ggpiestats ) # combining all individual plots from the list into a single plot using combine_plots function ggstatsplot::combine_plots( plot_list$1st + ggplot2::scale_fill_brewer(palette = "Dark2"),
plot_list$2nd + ggplot2::scale_fill_brewer(palette = "Dark2"), plot_list$3rd + ggplot2::scale_fill_manual(values = c("#D95F02", "#1B9E77")), # to be consistent with other legends
plot_list\$Crew + ggplot2::scale_fill_brewer(palette = "Dark2"),
title.text = "Survival in Titanic disaster by gender for all passenger classes",
caption.text = "Asterisks denote results from proportion tests; ***: p < 0.001, ns: non-significant",
nrow = 4,
ncol = 1,
labels = c("(a)","(b)","(c)", "(d)")
)

As can be appreciated from this example, although grouped_ggpiestats provides a quick way to explore data, purrr::pmap lets us utilize the full functionality of this function and ggplot2.

Within-subjects designs

Variant of this function for within-subjects designs, which will display results from McNemar test, is currently under work. You can still use this function just to prepare the plot for exploratory data analysis, but the statistical details displayed in the subtitle will be incorrect. You can remove them by adding + ggplot2::labs(subtitle = NULL).

Suggestions

If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/ggstatsplot/issues