Alluvial Plots in ggplot2

Jason Cory Brunson


The ggalluvial package is a ggplot2 extension for producing alluvial plots in a tidyverse framework. The design and functionality were originally inspired by the alluvial package and have benefitted from the feedback of many users. This vignette

Unlike most alluvial and related diagrams, the plots produced by ggalluvial are uniquely determined by the data set and statistical transformation. The distinction is detailed in this blog post.

Many other resources exist for visualizing categorical data in R, including several more basic plot types that are likely to more accurately convey proportions to viewers when the data are not so structured as to warrant an alluvial plot. In particular, check out Michael Friendly’s vcd and vcdExtra packages (PDF) for a variety of statistically-motivated categorical data visualization techniques, Hadley Wickham’s productplots package and Haley Jeppson and Heike Hofmann’s descendant ggmosaic package for product or mosaic plots, and Nicholas Hamilton’s ggtern package for ternary coordinates. Other related packages are mentioned below.

Alluvial plots

Here’s a quintessential alluvial plot:

The next section details how the elements of this image encode information about the underlying dataset. For now, we use the image as a point of reference to define the following elements of a typical alluvial plot:

As the examples in the next section will demonstrate, which of these elements are incorporated into an alluvial plot depends on both how the underlying data is structured and what the creator wants the plot to communicate.

Alluvial data

ggalluvial recognizes two formats of “alluvial data”, treated in detail in the following subsections, but which basically correspond to the “wide” and “long” formats of categorical repeated measures data. A third, tabular (or array), form is popular for storing data with multiple categorical dimensions, such as the Titanic and UCBAdmissions datasets.1 For consistency with tidy data principles and ggplot2 conventions, ggalluvial does not accept tabular input; base::data.frame() converts such an array to an acceptable data frame.

Alluvia (wide) format

The wide format reflects the visual arrangement of an alluvial plot, but “untwisted”: Each row corresponds to a cohort of observations that take a specific value at each variable, and each variable has its own column. An additional column contains the quantity of each row, e.g. the number of observational units in the cohort, which may be used to control the heights of the strata.2 Basically, the wide format consists of one row per alluvium. This is the format into which the base function transforms a frequency table, for instance the 3-dimensional UCBAdmissions dataset:

##       Admit Gender Dept Freq
## 1  Admitted   Male    A  512
## 2  Rejected   Male    A  313
## 3  Admitted Female    A   89
## 4  Rejected Female    A   19
## 5  Admitted   Male    B  353
## 6  Rejected   Male    B  207
## 7  Admitted Female    B   17
## 8  Rejected Female    B    8
## 9  Admitted   Male    C  120
## 10 Rejected   Male    C  205
## 11 Admitted Female    C  202
## 12 Rejected Female    C  391
## [1] TRUE

This format is inherited from the first version of ggalluvial, which modeled it after usage in alluvial: The user declares any number of axis variables, which stat_alluvium() and stat_stratum() recognize and process in a consistent way:

An important feature of these plots is the meaningfulness of the vertical axis: No gaps are inserted between the strata, so the total height of the plot reflects the cumulative quantity of the observations. The plots produced by ggalluvial conform (somewhat; keep reading) to the “grammar of graphics” principles of ggplot2, and this prevents users from producing “free-floating” visualizations like the Sankey diagrams showcased here.3 ggalluvial parameters and existing ggplot2 functionality can also produce parallel sets plots, illustrated here using the Titanic dataset:4

This format and functionality are useful for many applications and will be retained in future versions. They also involve some conspicuous deviations from ggplot2 norms:

Furthermore, format aesthetics like fill are necessarily fixed for each alluvium; they cannot, for example, change from axis to axis according to the value taken at each. This means that, although they can reproduce the branching-tree structure of parallel sets, this format and functionality cannot produce alluvial plots with the color schemes featured here (“Alluvial diagram”) and here (“Controlling colors”), which are “reset” at each axis.

Lodes (long) format

The long format recognized by ggalluvial contains one row per lode, and can be understood as the result of “gathering” (in the dplyr sense) or “pivoting” (in the Microsoft Excel sense) the axis columns of a dataset in the alluvia format into a key-value pair of columns encoding the axis as the key and the stratum as the value. This format requires an additional indexing column that links the rows corresponding to a common cohort, i.e. the lodes of a single alluvium:

##    Freq Cohort     x  stratum
## 1   512      1 Admit Admitted
## 2   313      2 Admit Rejected
## 3    89      3 Admit Admitted
## 4    19      4 Admit Rejected
## 5   353      5 Admit Admitted
## 6   207      6 Admit Rejected
## 7    17      7 Admit Admitted
## 8     8      8 Admit Rejected
## 9   120      9 Admit Admitted
## 10  205     10 Admit Rejected
## 11  202     11 Admit Admitted
## 12  391     12 Admit Rejected
## [1] TRUE

The functions that convert data between wide (alluvia) and long (lodes) format include several parameters that help preserve ancillary information. See help("alluvial-data") for examples.

The same stat and geom can receive data in this format using a different set of positional aesthetics, also specific to ggalluvial:

Heights can vary from axis to axis, allowing users to produce bump charts like those showcased here.5 In these cases, the strata contain no more information than the alluvia and often not plotted. For convenience, both stat_alluvium() and stat_flow() will accept arguments for x and alluvium even if none is given for stratum.6 As an example, we can group countries in the Refugees dataset by region, in order to compare refugee volumes at different scales:

The format allows us to assign aesthetics that change from axis to axis along the same alluvium, which is useful for repeated measures datasets. This requires generating a separate graphical object for each flow, as implemented in geom_flow(). The plot below uses a set of (changes to) students’ academic curricula over the course of several semesters. Since geom_flow() calls stat_flow() by default (see the next example), we override it with stat_alluvium() in order to track each student across all semesters:

The stratum heights y are unspecified, so each row is given unit height. This example demonstrates one way ggalluvial handles missing data. The alternative is to set the parameter na.rm to TRUE.7 Missing data handling (specifically, the order of the strata) also depends on whether the stratum variable is character or factor/numeric.

Finally, lode format gives us the option to aggregate the flows between adjacent axes, which may be appropriate when the transitions between adjacent axes are of primary importance. We can demonstrate this option on data from the influenza vaccination surveys conducted by the RAND American Life Panel:

This plot ignores any continuity between the flows between axes. This “memoryless” plot produces a less cluttered plot, in which at most one flow proceeds from each stratum at one axis to each stratum at the next, but at the cost of being able to track each cohort across the entire plot.


## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.0 (2020-04-24)
##  os       macOS High Sierra 10.13.6   
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  C                           
##  ctype    en_US.UTF-8                 
##  tz       America/New_York            
##  date     2020-08-30                  
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version    date       lib source                           
##  assertthat     0.2.1      2019-03-21 [3] CRAN (R 4.0.0)                   
##  cli            2.0.2      2020-02-28 [3] CRAN (R 4.0.0)                   
##  colorspace     1.4-1      2019-03-18 [3] CRAN (R 4.0.0)                   
##  crayon         1.3.4      2017-09-16 [3] CRAN (R 4.0.0)                   
##  digest         0.6.25     2020-02-23 [3] CRAN (R 4.0.0)                   
##  dplyr          1.0.2      2020-08-18 [3] CRAN (R 4.0.2)                   
##  ellipsis       0.3.1      2020-05-15 [3] CRAN (R 4.0.0)                   
##  evaluate       0.14       2019-05-28 [3] CRAN (R 4.0.0)                   
##  fansi          0.4.1      2020-01-08 [3] CRAN (R 4.0.0)                   
##  farver         2.0.3      2020-01-16 [3] CRAN (R 4.0.0)                   
##  generics       0.0.2      2018-11-29 [3] CRAN (R 4.0.0)                   
##  ggalluvial   * 0.12.2     2020-08-30 [1] local                            
##  ggplot2      * 3.3.2      2020-06-19 [3] CRAN (R 4.0.0)                   
##  glue           1.4.2      2020-08-27 [3] CRAN (R 4.0.0)                   
##  gtable         0.3.0      2019-03-25 [3] CRAN (R 4.0.0)                   
##  htmltools      0.5.0      2020-06-16 [3] CRAN (R 4.0.0)                   
##  knitr          1.29       2020-06-23 [3] CRAN (R 4.0.0)                   
##  labeling       0.3        2014-08-23 [3] CRAN (R 4.0.0)                   
##  lifecycle      0.2.0      2020-03-06 [3] CRAN (R 4.0.0)                   
##  magrittr       1.5        2014-11-22 [3] CRAN (R 4.0.0)                   
##  munsell        0.5.0      2018-06-12 [3] CRAN (R 4.0.0)                   
##  pillar         1.4.6      2020-07-10 [3] CRAN (R 4.0.0)                   
##  pkgconfig      2.0.3      2019-09-22 [3] CRAN (R 4.0.0)                   
##  purrr          0.3.4      2020-04-17 [3] CRAN (R 4.0.0)                   
##  R6             2.4.1      2019-11-12 [3] CRAN (R 4.0.0)                   
##  RColorBrewer   1.1-2      2014-12-07 [3] CRAN (R 4.0.0)                   
##  rlang          0.4.7      2020-07-09 [3] CRAN (R 4.0.2)                   
##  rmarkdown      2.3        2020-06-18 [3] CRAN (R 4.0.0)                   
##  scales         1.1.1      2020-05-11 [3] CRAN (R 4.0.0)                   
##  sessioninfo    1.1.1      2018-11-05 [3] CRAN (R 4.0.0)                   
##  stringi        1.4.6      2020-02-17 [3] CRAN (R 4.0.0)                   
##  stringr        1.4.0      2019-02-10 [3] CRAN (R 4.0.0)                   
##  tibble 2020-07-28 [3] Github (tidyverse/tibble@b4eec19)
##  tidyr          1.1.2      2020-08-27 [3] CRAN (R 4.0.0)                   
##  tidyselect     1.1.0      2020-05-11 [3] CRAN (R 4.0.0)                   
##  vctrs          0.3.4      2020-08-29 [3] CRAN (R 4.0.0)                   
##  withr          2.2.0      2020-04-20 [3] CRAN (R 4.0.0)                   
##  xfun           0.16       2020-07-24 [3] CRAN (R 4.0.2)                   
##  yaml           2.2.1      2020-02-01 [3] CRAN (R 4.0.0)                   
## [1] /private/var/folders/pg/fjg8r4fj5v33zqmwptf9mfg80000gn/T/RtmpQW9jfe/Rinst8b965390e4d
## [2] /private/var/folders/pg/fjg8r4fj5v33zqmwptf9mfg80000gn/T/Rtmp9RZVvQ/temp_libpath8a17612dde7
## [3] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

  1. See Friendly’s tutorial, linked above, for a discussion.

  2. Previously, quantities were passed to the weight aesthetic rather than to y. This prevented scale_y_continuous() from correctly transforming scales, and anyway it was inconsistent with the behavior of geom_bar(). As of version 0.12.0, weight is an optional parameter used only by computed variables intended for labeling, not by polygonal graphical elements.

  3. The ggforce package includes parallel set geom and stat layers to produce similar diagrams that can be allowed to free-float.

  4. A greater variety of parallel sets plots are implemented in the ggparallel and ggpcp packages.

  5. If bumping is unnecessary, consider using geom_area() instead.

  6. stat_stratum() will similarly accept arguments for x and stratum without alluvium. If both strata and either alluvia or flows are to be plotted, though, all three parameters need arguments.

  7. Be sure to set na.rm consistently in each layer, in this case both the flows and the strata.