# Graphical posterior predictive checks using the bayesplot package

## Introduction

This vignette focuses on graphical posterior predictive checks (PPC). Plots of parameter estimates from MCMC draws are covered in the separate vignette Plotting MCMC draws, and MCMC diagnostics are covered in the Visual MCMC diagnostics vignette.

### Graphical posterior predictive checks (PPCs)

The bayesplot package provides various plotting functions for graphical posterior predictive checking, that is, creating graphical displays comparing observed data to simulated data from the posterior predictive distribution.

The idea behind posterior predictive checking is simple: if a model is a good fit then we should be able to use it to generate data that looks a lot like the data we observed. To generate the data used for posterior predictive checks (PPCs) we simulate from the posterior predictive distribution. This is the distribution of the outcome variable implied by a model after using the observed data $$y$$ (a vector of $$N$$ outcome values) to update our beliefs about unknown model parameters $$\theta$$. The posterior predictive distribution for observation $$\widetilde{y}$$ can be written as $p(\widetilde{y} \,|\, y) = \int p(\widetilde{y} \,|\, \theta) \, p(\theta \,|\, y) \, d\theta.$ Typically we will also condition on $$X$$ (a matrix of predictor variables).

For each draw (simulation) $$s = 1, \ldots, S$$ of the parameters from the posterior distribution, $$\theta^{(s)} \sim p(\theta \,|\, y)$$, we draw an entire vector of $$N$$ outcomes $$\widetilde{y}^{(s)}$$ from the posterior predictive distribution by simulating from the data model conditional on parameters $$\theta^{(s)}$$. The result is an $$S \times N$$ matrix of draws $$\widetilde{y}$$.

When simulating from the posterior predictive distribution we can use either the same values of the predictors $$X$$ that we used when fitting the model or new observations of those predictors. When we use the same values of $$X$$ we denote the resulting simulations by $$y^{rep}$$, as they can be thought of as replications of the outcome $$y$$ rather than predictions for future observations ($$\widetilde{y}$$ using predictors $$\widetilde{X}$$). This corresponds to the notation from Gelman et al. (2013) and is the notation used throughout the package documentation.

Using the replicated datasets drawn from the posterior predictive distribution, the functions in the bayesplot package create various graphical displays comparing the observed data $$y$$ to the replications. The names of the bayesplot plotting functions for posterior predictive checking all have the prefix ppc_.

### Setup

• ggplot2, in case we want to customize the ggplot objects created by bayesplot
• rstanarm, for fitting the example models used throughout the vignette
library("bayesplot")
library("ggplot2")
library("rstanarm")      

### Example models

To demonstrate some of the various PPCs that can be created with the bayesplot package we’ll use an example of comparing Poisson and Negative binomial regression models from one of the rstanarm package vignettes (Gabry and Goodrich, 2017).

We want to make inferences about the efficacy of a certain pest management system at reducing the number of roaches in urban apartments. […] The regression predictors for the model are the pre-treatment number of roaches roach1, the treatment indicator treatment, and a variable senior indicating whether the apartment is in a building restricted to elderly residents. Because the number of days for which the roach traps were used is not the same for all apartments in the sample, we include it as an exposure […].

First we fit a Poisson regression model with outcome variable y representing the roach count in each apartment at the end of the experiment.

head(roaches) # see help("rstanarm-datasets")
    y roach1 treatment senior exposure2
1 153 308.00         1      0  0.800000
2 127 331.25         1      0  0.600000
3   7   1.67         1      0  1.000000
4   7   3.00         1      0  1.000000
5   0   2.00         1      0  1.142857
6   0   0.00         1      0  1.000000
roaches$roach100 <- roaches$roach1 / 100 # pre-treatment number of roaches (in 100s)
fit_poisson <- stan_glm(
y ~ roach100 + treatment + senior,
offset = log(exposure2),
data = roaches,
seed = 1111
)
print(fit_poisson)
stan_glm
family:       poisson [log]
formula:      y ~ roach100 + treatment + senior
observations: 262
predictors:   4
------
(Intercept)  3.1    0.0
roach100     0.7    0.0
treatment   -0.5    0.0
senior      -0.4    0.0

Sample avg. posterior predictive distribution of y:
mean_PPD 25.6    0.5

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

We’ll also fit the negative binomial model that we’ll compare to the Poisson:

fit_nb <- update(fit_poisson, family = "neg_binomial_2")
print(fit_nb)
stan_glm
family:       neg_binomial_2 [log]
formula:      y ~ roach100 + treatment + senior
observations: 262
predictors:   4
------
(Intercept)  2.8    0.2
roach100     1.3    0.3
treatment   -0.8    0.3
senior      -0.3    0.3

Auxiliary parameter(s):
reciprocal_dispersion 0.3    0.0

Sample avg. posterior predictive distribution of y:
mean_PPD 50.8   30.7

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

In order to use the PPC functions from the bayesplot package we need a vector y of outcome values,

## Providing an interface to bayesplot PPCs from another package

The bayesplot package provides the S3 generic function pp_check. Authors of R packages for Bayesian inference are encouraged to define methods for the fitted model objects created by their packages. This will hopefully be convenient for both users and developers and contribute to the use of the same naming conventions across many of the R packages for Bayesian data analysis.

To provide an interface to bayesplot from your package, you can very easily define a pp_check method (or multiple pp_check methods) for the fitted model objects created by your package. All a pp_check method needs to do is provide the y vector and yrep matrix arguments to the various plotting functions included in bayesplot.

### Defining a pp_check method

Here is an example for how to define a simple pp_check method in a package that creates fitted model objects of class "foo". We will define a method pp_check.foo that extracts the data y and the draws from the posterior predictive distribution yrep from an object of class "foo" and then calls one of the plotting functions from bayesplot.

Suppose that objects of class "foo" are lists with named components, two of which are y and yrep. Here’s a simple method pp_check.foo that offers the user the option of two different plots:

# @param object An object of class "foo".
# @param type The type of plot.
# @param ... Optional arguments passed on to the bayesplot plotting function.
pp_check.foo <- function(object, type = c("multiple", "overlaid"), ...) {
type <- match.arg(type)
y <- object[["y"]]
yrep <- object[["yrep"]]
stopifnot(nrow(yrep) >= 50)
samp <- sample(nrow(yrep), size = ifelse(type == "overlaid", 50, 5))
yrep <- yrep[samp, ]

if (type == "overlaid") {
ppc_dens_overlay(y, yrep, ...)
} else {
ppc_hist(y, yrep, ...)
}
}

To try out pp_check.foo we can just make a list with y and yrep components and give it class foo:

x <- list(y = rnorm(200), yrep = matrix(rnorm(1e5), nrow = 500, ncol = 200))
class(x) <- "foo"
color_scheme_set("purple")
pp_check(x, type = "multiple", binwidth = 0.3)

color_scheme_set("darkgray")
pp_check(x, type = "overlaid")

### Examples of pp_check methods in other packages

Several packages currently use this approach to provide an interface to bayesplot’s graphical posterior predictive checks. See, for example, the pp_check methods in the rstanarm and brms packages.

## References

Buerkner, P. (2017). brms: Bayesian Regression Models using Stan. R package version 1.7.0. https://CRAN.R-project.org/package=brms

Gabry, J., and Goodrich, B. (2017). rstanarm: Bayesian Applied Regression Modeling via Stan. R package version 2.15.3. http://mc-stan.org/rstanarm, https://CRAN.R-project.org/package=rstanarm

Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., Gelman, A. (2018). Visualization in Bayesian workflow. Journal of the Royal Statistical Society Series A, accepted for publication. arXiv preprint: https://arxiv.org/abs/1709.01449

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis. Chapman & Hall/CRC Press, London, third edition.

Stan Development Team. (2017). Stan Modeling Language Users Guide and Reference Manual. http://mc-stan.org/users/documentation/