The sparseHessianFD package provides an interface to ACM TOMS Algorithm 636 for estimating a Hessian that is sparse, in that a large proportion of the cross-partial derivatives are zero. The user provides the following:

1. an R function that returns the value of the objective function whene valuated at a variable vector $$P$$;
2. an R function that returns the gradient of that objective function at $$P$$; and
3. the row and column indices for the non-zero elements in the lower triangle of the Hessian.

The non-zero elements are estimated through finite differencing of the gradients in a way that exploits the sparsity pattern of the Hessian.

There are, of course, other methods to estimate sparse Hessians, such as automatic differentiation (AD). However, operator-overloading implementations of AD require the objective function to be written using specialized libraries. These libraries are available for C++, Fortran, Matlab, Python, and possibly other programming languages and environments. At the present, there are no AD implementations that are “native” to R. Other than deriving the Hessian analytically, finite differencing remains the best option when the objective and gradient functions are written purely in R.

Example function

Before going into the details of how to use the package, let’s consider the following example of an objective function with a sparse Hessian. Suppose we have a dataset of $$N$$ households, each with $$T$$ opportunities to purchase a particular product. Let $$y_i$$ be the number of times household $$i$$ purchases the product, out of the $$T$$ purchase opportunities. Furthermore, let $$p_i$$ be the probability of purchase; $$p_i$$ is the same for all $$T$$ opportunities, so we can treat $$y_i$$ as a binomial random variable. The purchase probability $$p_i$$ is heterogeneous, and depends on both $$k$$ continuous covariates $$x_i$$, and a heterogeneous coefficient vector $$\beta_i$$, such that $p_i=\frac{\exp(x_i'\beta_i)}{1+\exp(x_i'\beta_i)},~i=1 ... N$

The coefficients can be thought of as sensitivities to the covariates, and they are distributed across the population of households following a multivariate normal distribution with mean $$\mu$$ and covariance $$\Sigma$$. We assume that we know $$\Sigma$$, but we do not know $$\mu$$. Instead, we place a multivariate normal prior on $$\mu$$, with mean $$0$$ and covariance $$\Omega_0$$. Thus, each $$\beta_i$$, and $$\mu$$ are $$k-$$dimensional vectors, and the total number of unknown variables in the model is $$(N+1)k$$.

The log posterior density, ignoring any normalization constants, is $\log \pi(\beta_{1:N},\mu|Y, X, \Sigma_0,\Omega_0)=\sum_{i=1}^Np_i^{y_i}(1-p_i)^{T-y_i} -\frac{1}{2}\left(\beta_i-\mu\right)'\Sigma^{-1}\left(\beta_i-\mu\right) -\frac{1}{2}\mu'\Omega_0^{-1}\mu$

Since the $$\beta_i$$ are drawn iid from a multivariate normal, $$\dfrac{\partial^2\log\pi }{\partial\beta_i\beta_j}=0$$ for all $$i\neq j$$. We also know that all of the $$\beta_i$$ are correlated with $$\mu$$. The structure of the Hessian depends on how the variables are ordered within the vector. One such ordering is to group all of the coefficients for each unit together.

$\beta_{11},...,\beta_{1k},\beta_{21},...,\beta_{2k},...~,~...~,~\beta_{N1}~,~...~,~\beta_{Nk},\mu_1,...,\mu_k$

In this case, the Hessian has a “block-arrow” structure. For example, if $$N=6$$ and $$k=2$$, then there are 14 total variables, and the Hessian will have the following pattern.

## 14 x 14 sparse Matrix of class "lgCMatrix"
##
##  [1,] | | . . . . . . . . . . | |
##  [2,] | | . . . . . . . . . . | |
##  [3,] . . | | . . . . . . . . | |
##  [4,] . . | | . . . . . . . . | |
##  [5,] . . . . | | . . . . . . | |
##  [6,] . . . . | | . . . . . . | |
##  [7,] . . . . . . | | . . . . | |
##  [8,] . . . . . . | | . . . . | |
##  [9,] . . . . . . . . | | . . | |
## [10,] . . . . . . . . | | . . | |
## [11,] . . . . . . . . . . | | | |
## [12,] . . . . . . . . . . | | | |
## [13,] | | | | | | | | | | | | | |
## [14,] | | | | | | | | | | | | | |

There are 196 elements in this symmetric matrix, but only 76 are non-zero, and only 45 values are unique. Although the reduction in RAM from using a sparse matrix structure for the Hessian may be modest, consider what would happen if $$N=1000$$ instead. In that case, there are 2002 variables in the problem, and more than $$4$$ million elements in the Hessian. However, only $$12004$$ of those elements are non-zero. If we work with only the lower triangle of the Hessian we only need to work with only 7003 values.

Another possibility is to group coefficients for each covariate together.

$\beta_{11},...,\beta_{1N},\beta_{21},...,\beta_{2N},...,...,\beta_{k1},...,\beta_{kN},\mu_1,...,\mu_k$

Now the Hessian has an “off-diagonal” sparsity pattern.

## 14 x 14 sparse Matrix of class "lgCMatrix"
##
##  [1,] | . . . . . | . . . . . | |
##  [2,] . | . . . . . | . . . . | |
##  [3,] . . | . . . . . | . . . | |
##  [4,] . . . | . . . . . | . . | |
##  [5,] . . . . | . . . . . | . | |
##  [6,] . . . . . | . . . . . | | |
##  [7,] | . . . . . | . . . . . | |
##  [8,] . | . . . . . | . . . . | |
##  [9,] . . | . . . . . | . . . | |
## [10,] . . . | . . . . . | . . | |
## [11,] . . . . | . . . . . | . | |
## [12,] . . . . . | . . . . . | | |
## [13,] | | | | | | | | | | | | | |
## [14,] | | | | | | | | | | | | | |

In both cases, the number of non-zeros is the same. The block-diagonal case may lead to slightly faster initialization, but repeated computation of the Hessian will take the same time as with the “off-diagonal” case.

Using the package

The functions for computing the objective function, gradient and Hessian for this example are in the R/binary.R file. The package also includes a sample dataset with simulated data from the binary choice example.

To start, we load the data, set some dimension parameters, set prior values for $$\Sigma^{-1}$$ and $$\Omega^{-1}$$, and simulate a vector of variables at which to evaluate the function.

set.seed(123)
data(binary)
str(binary)
## List of 3
##  $Y: int [1:30] 3 11 11 9 5 9 19 19 10 14 ... ##$ X: num [1:2, 1:30] 1.5587 0.0705 0.1293 1.7151 0.4609 ...
##  $T: num 20 N <- length(binary$Y)
k <- NROW(binary$X) nvars <- as.integer(N*k + k) P <- rnorm(nvars) ## random starting values priors <- list(inv.Sigma = rWishart(1,k+5,diag(k))[,,1], inv.Omega = diag(k)) This dataset represents the simulated choices for $$N= 30$$ customers over $$T= TRUE$$ purchase opportunties, where the probability of purchase is influenced by $$k= 2$$ covariates. The objective function for the binary choice example is binary.f and the gradient function is binary.grad. The first argument to both is the variable vector, and the argument lists must be the same for both. For this example, we need to provide the data list ($$X$$, $$Y$$ and $$T$$) and the prior parameter list ($$\Sigma^{-1}$$ and $$\Omega^{-1}$$). The functions also have an order.row argument to change the ordering of the variables (and thus, the sparsity pattern). If order.row=TRUE, then the Hessian will have an off-diagonal pattern. If order.row=FALSE, then the Hessian will have a block-arrow pattern. For testing and demonstration purposes, we also have a binary.hess function that returns the Hessian as a sparse dgCMatrix object (see the Matrix package). true.f <- binary.f(P, binary, priors, order.row = FALSE) true.grad <- binary.grad(P, binary, priors, order.row = FALSE) true.hess <- binary.hess(P, binary, priors, order.row = FALSE) The sparsity pattern of the Hessian is specified by two integer vectors: one each for the row and column indices of the non-zero elements of the lower triangular part of the Hessian. If you happen have have an example of a matrix with the same sparsity pattern of the Hessian you are trying to compute, you can use the following convenience function to extract the appropriate index vectors. pattern <- Matrix.to.Coord(tril(true.hess)) str(pattern) ## List of 2 ##$ rows: int [1:213] 1 2 61 62 2 61 62 3 4 61 ...
##  $cols: int [1:213] 1 1 1 1 2 2 2 3 3 3 ... Next, we create a new instance of a sparseHessianFD object with an “initial variable” $$P$$, and the row and column indices of the non-zero elements in the lower triangle of the Hessian. We also pass in any other arguments for binary.f and binary.grad. We accept the default values for other arguments to sparseHessianFD.new. obj <- sparseHessianFD.new(P, binary.f, binary.grad, rows=pattern$rows, cols=pattern$cols, data=binary, priors=priors, order.row=FALSE) Now we can evaluate the function value, gradient and Hessian through obj. f <- obj$fn(P)
gr <- obj$gr(P) hs <- obj$hessian(P)

Note that the member functions in the sparseHessianFD class take only one argument: the variable vector. All of the other arguments are stored in obj.

Do we get the same results that we would get after calling binary.f, binary.grad and binary.hess directly? Let’s see.

all.equal(f, true.f)
## [1] TRUE
## [1] TRUE
all.equal(hs, true.hess)
## [1] TRUE

If there is any difference, keep in mind that hs is a numerical estimate that is not always exact. I certainly wouldn’t worry about mean relative differences smaller than, say, $$10^{-6}$$.

Speed comparison

Instead of using this package, we could treat the Hessian as dense, and use the hessian function numDeriv package. The advantage of the numDeriv package is that it does not require the gradient. However, you can see that it takes some time to run.

library(numDeriv)
hess.time <- system.time(H1 <- obj$hessian(P)) fd.time <- system.time(H2 <- hessian(obj$fn, P))
H2 <- drop0(H2, 1e-07)  ## treat values < 1e-8 as zero
print(hess.time)
##    user  system elapsed
##   0.002   0.000   0.002
print(fd.time)
##    user  system elapsed
##   3.388   0.022   3.411

The sparseHessianFD package can be substantially faster than estimating a dense Hessian by brute force finite differencing. The cost of this speed up is that the user does need to provide the gradient and the sparsity structure. As with everything in life, there are trade-offs.

Quick summary

In short, to use the package, follow the following steps:

1. Write R functions to return the value and gradient of the objective function.
2. Determine the row and column indices of the non-zero elements of the lower triangle of the Hessian.
3. Pick a variable vector (i.e., a starting value) at which you can initialize the sparseHessian object. It doesn’t really matter what this vector is, as long as the function value and gradient elements are all finite.
4. Create a new sparseHessianFD object using the sparseHessianFD.new function. For this example, call that object F.
5. Compute Hessian at x by calling F\$hessian(x).
6. ???
7. Profit.