The sparseHessianFD package provides an interface to ACM TOMS Algorithm 636 for estimating a Hessian that is sparse, in that a large proportion of the cross-partial derivatives are zero. The user provides the following:
The non-zero elements are estimated through finite differencing of the gradients in a way that exploits the sparsity pattern of the Hessian.
There are, of course, other methods to estimate sparse Hessians, such as automatic differentiation (AD). However, operator-overloading implementations of AD require the objective function to be written using specialized libraries. These libraries are available for C++, Fortran, Matlab, Python, and possibly other programming languages and environments. At the present, there are no AD implementations that are “native” to R. Other than deriving the Hessian analytically, finite differencing remains the best option when the objective and gradient functions are written purely in R.
Before going into the details of how to use the package, let’s consider the following example of an objective function with a sparse Hessian. Suppose we have a dataset of \(N\) households, each with \(T\) opportunities to purchase a particular product. Let \(y_i\) be the number of times household \(i\) purchases the product, out of the \(T\) purchase opportunities. Furthermore, let \(p_i\) be the probability of purchase; \(p_i\) is the same for all \(T\) opportunities, so we can treat \(y_i\) as a binomial random variable. The purchase probability \(p_i\) is heterogeneous, and depends on both \(k\) continuous covariates \(x_i\), and a heterogeneous coefficient vector \(\beta_i\), such that \[ p_i=\frac{\exp(x_i'\beta_i)}{1+\exp(x_i'\beta_i)},~i=1 ... N \]
The coefficients can be thought of as sensitivities to the covariates, and they are distributed across the population of households following a multivariate normal distribution with mean \(\mu\) and covariance \(\Sigma\). We assume that we know \(\Sigma\), but we do not know \(\mu\). Instead, we place a multivariate normal prior on \(\mu\), with mean \(0\) and covariance \(\Omega_0\). Thus, each \(\beta_i\), and \(\mu\) are \(k-\)dimensional vectors, and the total number of unknown variables in the model is \((N+1)k\).
The log posterior density, ignoring any normalization constants, is \[ \log \pi(\beta_{1:N},\mu|Y, X, \Sigma_0,\Omega_0)=\sum_{i=1}^Np_i^{y_i}(1-p_i)^{T-y_i} -\frac{1}{2}\left(\beta_i-\mu\right)'\Sigma^{-1}\left(\beta_i-\mu\right) -\frac{1}{2}\mu'\Omega_0^{-1}\mu \]
Since the \(\beta_i\) are drawn iid from a multivariate normal, \(\dfrac{\partial^2\log\pi }{\partial\beta_i\beta_j}=0\) for all \(i\neq j\). We also know that all of the \(\beta_i\) are correlated with \(\mu\). The structure of the Hessian depends on how the variables are ordered within the vector. One such ordering is to group all of the coefficients for each unit together.
\[ \beta_{11},...,\beta_{1k},\beta_{21},...,\beta_{2k},...~,~...~,~\beta_{N1}~,~...~,~\beta_{Nk},\mu_1,...,\mu_k \]
In this case, the Hessian has a “block-arrow” structure. For example, if \(N=6\) and \(k=2\), then there are 14 total variables, and the Hessian will have the following pattern.
## 14 x 14 sparse Matrix of class "lgCMatrix"
##
## [1,] | | . . . . . . . . . . | |
## [2,] | | . . . . . . . . . . | |
## [3,] . . | | . . . . . . . . | |
## [4,] . . | | . . . . . . . . | |
## [5,] . . . . | | . . . . . . | |
## [6,] . . . . | | . . . . . . | |
## [7,] . . . . . . | | . . . . | |
## [8,] . . . . . . | | . . . . | |
## [9,] . . . . . . . . | | . . | |
## [10,] . . . . . . . . | | . . | |
## [11,] . . . . . . . . . . | | | |
## [12,] . . . . . . . . . . | | | |
## [13,] | | | | | | | | | | | | | |
## [14,] | | | | | | | | | | | | | |
There are 196 elements in this symmetric matrix, but only 76 are non-zero, and only 45 values are unique. Although the reduction in RAM from using a sparse matrix structure for the Hessian may be modest, consider what would happen if \(N=1000\) instead. In that case, there are 2002 variables in the problem, and more than \(4\) million elements in the Hessian. However, only \(12004\) of those elements are non-zero. If we work with only the lower triangle of the Hessian we only need to work with only 7003 values.
Another possibility is to group coefficients for each covariate together.
\[ \beta_{11},...,\beta_{1N},\beta_{21},...,\beta_{2N},...,...,\beta_{k1},...,\beta_{kN},\mu_1,...,\mu_k \]
Now the Hessian has an “off-diagonal” sparsity pattern.
## 14 x 14 sparse Matrix of class "lgCMatrix"
##
## [1,] | . . . . . | . . . . . | |
## [2,] . | . . . . . | . . . . | |
## [3,] . . | . . . . . | . . . | |
## [4,] . . . | . . . . . | . . | |
## [5,] . . . . | . . . . . | . | |
## [6,] . . . . . | . . . . . | | |
## [7,] | . . . . . | . . . . . | |
## [8,] . | . . . . . | . . . . | |
## [9,] . . | . . . . . | . . . | |
## [10,] . . . | . . . . . | . . | |
## [11,] . . . . | . . . . . | . | |
## [12,] . . . . . | . . . . . | | |
## [13,] | | | | | | | | | | | | | |
## [14,] | | | | | | | | | | | | | |
In both cases, the number of non-zeros is the same. The block-diagonal case may lead to slightly faster initialization, but repeated computation of the Hessian will take the same time as with the “off-diagonal” case.
The functions for computing the objective function, gradient and Hessian for this example are in the R/binary.R file. The package also includes a sample dataset with simulated data from the binary choice example.
To start, we load the data, set some dimension parameters, set prior values for \(\Sigma^{-1}\) and \(\Omega^{-1}\), and simulate a vector of variables at which to evaluate the function.
set.seed(123)
data(binary)
str(binary)
## List of 3
## $ Y: int [1:30] 3 11 11 9 5 9 19 19 10 14 ...
## $ X: num [1:2, 1:30] 1.5587 0.0705 0.1293 1.7151 0.4609 ...
## $ T: num 20
N <- length(binary$Y)
k <- NROW(binary$X)
nvars <- as.integer(N*k + k)
P <- rnorm(nvars) ## random starting values
priors <- list(inv.Sigma = rWishart(1,k+5,diag(k))[,,1],
inv.Omega = diag(k))
This dataset represents the simulated choices for \(N= 30\) customers over \(T= TRUE\) purchase opportunties, where the probability of purchase is influenced by \(k= 2\) covariates.
The objective function for the binary choice example is binary.f
and the gradient function is binary.grad
. The first argument to both is the variable vector, and the argument lists must be the same for both. For this example, we need to provide the data list (\(X\), \(Y\) and \(T\)) and the prior parameter list (\(\Sigma^{-1}\) and \(\Omega^{-1}\)). The functions also have an order.row
argument to change the ordering of the variables (and thus, the sparsity pattern). If order.row=TRUE
, then the Hessian will have an off-diagonal pattern. If order.row=FALSE
, then the Hessian will have a block-arrow pattern.
For testing and demonstration purposes, we also have a binary.hess
function that returns the Hessian as a sparse dgCMatrix
object (see the Matrix package).
true.f <- binary.f(P, binary, priors, order.row = FALSE)
true.grad <- binary.grad(P, binary, priors, order.row = FALSE)
true.hess <- binary.hess(P, binary, priors, order.row = FALSE)
The sparsity pattern of the Hessian is specified by two integer vectors: one each for the row and column indices of the non-zero elements of the lower triangular part of the Hessian. If you happen have have an example of a matrix with the same sparsity pattern of the Hessian you are trying to compute, you can use the following convenience function to extract the appropriate index vectors.
pattern <- Matrix.to.Coord(tril(true.hess))
str(pattern)
## List of 2
## $ rows: int [1:213] 1 2 61 62 2 61 62 3 4 61 ...
## $ cols: int [1:213] 1 1 1 1 2 2 2 3 3 3 ...
Next, we create a new instance of a sparseHessianFD object with an “initial variable” \(P\), and the row and column indices of the non-zero elements in the lower triangle of the Hessian. We also pass in any other arguments for binary.f
and binary.grad
. We accept the default values for other arguments to sparseHessianFD.new
.
obj <- sparseHessianFD.new(P, binary.f, binary.grad,
rows=pattern$rows, cols=pattern$cols,
data=binary, priors=priors,
order.row=FALSE)
Now we can evaluate the function value, gradient and Hessian through obj
.
f <- obj$fn(P)
gr <- obj$gr(P)
hs <- obj$hessian(P)
Note that the member functions in the sparseHessianFD class take only one argument: the variable vector. All of the other arguments are stored in obj
.
Do we get the same results that we would get after calling binary.f
, binary.grad
and binary.hess
directly? Let’s see.
all.equal(f, true.f)
## [1] TRUE
all.equal(gr, true.grad)
## [1] TRUE
all.equal(hs, true.hess)
## [1] TRUE
If there is any difference, keep in mind that hs
is a numerical estimate that is not always exact. I certainly wouldn’t worry about mean relative differences smaller than, say, \(10^{-6}\).
Instead of using this package, we could treat the Hessian as dense, and use the hessian function numDeriv package. The advantage of the numDeriv package is that it does not require the gradient. However, you can see that it takes some time to run.
library(numDeriv)
hess.time <- system.time(H1 <- obj$hessian(P))
fd.time <- system.time(H2 <- hessian(obj$fn, P))
H2 <- drop0(H2, 1e-07) ## treat values < 1e-8 as zero
print(hess.time)
## user system elapsed
## 0.002 0.000 0.002
print(fd.time)
## user system elapsed
## 3.388 0.022 3.411
The sparseHessianFD package can be substantially faster than estimating a dense Hessian by brute force finite differencing. The cost of this speed up is that the user does need to provide the gradient and the sparsity structure. As with everything in life, there are trade-offs.
In short, to use the package, follow the following steps: