Efficient Computation of the Poisson-Binomial Distribution

Introduction

The Poisson binomial distribution is becoming increasingly important, especially in the areas of statistics, finance, insurance mathematics and quality management. Still, only a few R packages exist for its calculation, namely poibin and poisbinom. Both are based on Hong (2013). The first one implements the exact DFT-CF algorithm along with the exact recursive method, which is very memory-demanding, and Normal and Poisson approximations. The latter package only provides a more efficient DFT-CF implementation. Unfortunately, it sometimes yields negative probabilities, especially for large distributions. This numerical issue has not been addressed to date. Biscarri, Zhao & Brunner (2018) developed two more efficient procedures, but at the time of this writing, no package exists that implements them, because the authors published a reference implementation for R, but refrained from releasing it as a package. In addition to the disadvantages regarding computational speed, especially for very large distributions, poibin and poisbinom do not provide headers to their internal C/C++ functions, so that they cannot be imported directly by C or C++ code of other packages that use for example Rcpp. In our own project, we often have to deal with Poisson binomial distributions that include Bernoulli trials with \(p_i \in \{0, 1\}\). Computation can be further optimized by handling these trials before the actual computations. None of the aforementioned packages do that. That is why we decided to develop PoissonBinomial. We needed a package that

While implementing the procedures of Biscarri, Zhao & Brunner (2018), it was decided to also include all methods that are described in Hong (2013), together with three additional binomial approximations. In total, the PoissonBinomial package includes 10 different algorithms of computing the Poisson binomial distribution, including optimized versions of the Normal, Refined Normal and Poisson Approaches. The exact recursive procedure has been improved so it is hugely less memory-demanding: the poibin implementation needs the memory equivalent of roughly \((n + 1)^2\) values of type double, while ours only needs \(3 \cdot (n + 1)\).

Exact Procedures

In this package, the following exact algorithms for computing the Poisson Binomial distribution with Bernoulli probabilities \(p_1, ..., p_n\) are implemented:

Examples and performance comparisons of these procedures are provided in a separate vignette.

Approximations

In addition, the following approximation methods are provided:

Again, examples and performance comparisons are for these approaches are presented in a separate vignette as well.

Handling special cases, zeros and ones

Unfortunately, some approximations do not work at all for Bernoulli trials with \(p_i = 1\). This is why handling these values before the actual computation of the distribution is not only a performance tweak, but sometimes even a necessity. It is achieved by some simple preliminary considerations:

  1. Are all \(p_i\) equal?
    In this case, we have an ordinary binomial distribution. The specified method of computation is then ignored. In particular, the following applies:
    1. All \(p_i = 0\): The only observable value is \(0\), i.e. \(P(X = 0) = 1\) and \(P(X \neq 0) = 0\).
    2. All \(p_i = 1\): The only observable value is \(n\), i.e. \(P(X = n) = 1\) and \(P(X \neq n) = 0\).
  2. Are all of the \(p_i \in \{0, 1\} (i = 1, ..., n)\)?
    If one \(p_i\) is 1, it is impossible to measure 0 successes. Following the same logic, if two \(p_i\) are 1, we cannot observe 0 and 1 successes and so on. In general, a number of \(n_1\) values with \(p_i = 1\) makes it impossible to measure \(0, ..., n_1 - 1\) successes. Likewise, if there are \(n_0\) Bernoulli trials with \(p_i = 0\), we cannot observe \(n - n_0 + 1, ..., n\) successes. If all \(p_i \in \{0, 1\}\), it holds \(n = n_0 + n_1\). As a result, the only observable value is \(n_1\), i.e. \(P(X = n_1) = 1\) and \(P(X \neq n_1) = 0\).
  3. Are there \(p_i \notin \{0, 1\}\)?
    Using the same logic as before, we can only observe an “inner” distribution in the range of \(n_1, n_1 + 1, ..., n - n_0\), i.e. \(P(X \in \{n_1, ..., n - n_0\}) > 0\) and \(P(X < n_1) = P(X > n - n_0) = 0\). As a result, \(X\) can be expressed as \(X = n_1 + Y\) with \(Y \sim PBin(\{p_i|0 < p_i < 1\})\) and \(|\{p_i|0 < p_i < 1\}| = n - n_0 - n_1\). Subsequently, the Poisson binomial distribution must only be computed for \(Y\). Especially, if there is only one \(p_i \notin \{0, 1\}\), \(Y\) follows a Bernoulli distribution with parameter \(p_i\), i.e. \(P(X = n_1) = P(Y = 0) = 1 - p_i\) and \(P(X = n_1 + 1) = P(Y = 1) = p_i\).

These cases are illustrated in the following example:

library(PoissonBinomial)

# Case 1
dpbinom(NULL, rep(0.3, 7))
#> [1] 0.0823543 0.2470629 0.3176523 0.2268945 0.0972405 0.0250047 0.0035721
#> [8] 0.0002187
dbinom(0:7, 7, 0.3)
#> [1] 0.0823543 0.2470629 0.3176523 0.2268945 0.0972405 0.0250047 0.0035721
#> [8] 0.0002187
# equal results
dpbinom(NULL, c(0, 0, 0, 0, 0, 0, 0))
#> [1] 1 0 0 0 0 0 0 0
dpbinom(NULL, c(1, 1, 1, 1, 1, 1, 1))
#> [1] 0 0 0 0 0 0 0 1

# Case 2
dpbinom(NULL, c(0, 0, 0, 0, 1, 1, 1))
#> [1] 0 0 0 1 0 0 0 0

# Case 3
dpbinom(NULL, c(0, 0, 0.1, 0.2, 0.4, 0.8, 1))
#> [1] 0.0000 0.0864 0.4344 0.3784 0.0944 0.0064 0.0000 0.0000
dpbinom(NULL, c(0, 0, 0.4, 1))
#> [1] 0.0 0.6 0.4 0.0 0.0

Usage with Rcpp

How to import and use the internal C++ functions in Rcpp based packages is described in a separate vignette.