# Leave-one-out cross-validation and error correction

#### 2017-09-10

Cross-validation is often used in machine learning to judge how well a model is fit. Instead of using the entire data set to fit the model, it will use one part of the data set to fit a model and then test the model on the remaining data. This gives an idea of how well the model will generalize to indpendent data.

## Leave-one-out predictions using Gaussian processes

Leave-one-out prediction uses an entire model fit to all the data except a single point, and then makes a prediction at that point which can be compared to the actual value. It seems like this may be very expensive to do, but it is actually an inexpensive computation for a Gaussian process model, as long as the same parameters are used from the full model. This will bias the predictions to better results than if parameters were re-estimated.

Normally each prediction point requires solving a matrix equation. To predict the output, $$y$$, at point $$mathbf{x}$$, given input data in matrix $$X_2$$ and output $$\mathbf{y_2}$$, we use the equation $\hat{y} = \hat{\mu} + R(\mathbf{x},~X_2) R(X_2)^{-1}( \mathbf{y_2} - \mu\mathbf{1_n}))$ For leave-one-out predictions, the matrix $$X_2$$ will have all the design points except for the one we are predicting at, and thus will be different for each one. However, we will have the correlation matrix $$R$$ for the full data set from estimating the parameters, and there is a shortcut to find the inverse of a matrix leaving out a single row and column.

There is significant speed-up by using a multiplication instead of a matrix solve. The code chunk below shows that solving with a square matrix with 200 rows is over 30 times slower than a matrix multiplication.

n <- 200
m1 <- matrix(runif(n*n),ncol=n)
b1 <- runif(n)
if (requireNamespace("microbenchmark", quietly = TRUE)) {
microbenchmark::microbenchmark(solve(m1, b1), m1 %*% b1)
}
## Unit: microseconds
##           expr      min       lq       mean   median       uq      max
##  solve(m1, b1) 1972.806 2002.265 2366.32924 2027.923 2532.718 5088.623
##      m1 %*% b1   57.017   58.918   69.18893   59.679   73.363  242.134
##  neval
##    100
##    100

### Getting the inverse of a submatrix

Suppose we have a matrix $$K$$ and know its inverse $$K^{-1}$$. Suppose that $$K$$ has block structure $K = \begin{bmatrix} A~B \\ C~D \end{bmatrix}$ Now we want to find out how to find $$A^{-1}$$ using $$K^{-1}$$ instead of doing the full inverse. We can write $$K^{-1}$$ in block structure $K^{-1} = \begin{bmatrix} E~F \\ G~H \end{bmatrix}$

Now we use the fact that $$K K^{-1} = I$$ $\begin{bmatrix} I~0 \\ 0~I \end{bmatrix} = \begin{bmatrix} A~B \\ C~D \end{bmatrix}\begin{bmatrix} E~F \\ G~H \end{bmatrix}$

This gives the equations $AE + BG = I$ $AF + BH = 0$ $CE + DG = 0$ $CF + DH = I$

Solving the first equation gives that $A = (I - BG)E^{-1}$ or $A^{-1} = E (I - BG) ^{-1}$

### Leave-one-out covariance matrix inverse for Gaussian processes

For Gaussian processes we can consider the block matrix for the covariance (or correlation) matrix where a single row and its corresponding column is being removed. Let the first $$n-1$$ rows and columns be the covariance of the points in design matrix $$X$$, while the last row and column are the covariance for the vector $$\mathbf{x}$$ with $$X$$ and $$\mathbf{x}$$. Then we can have

$K = \begin{bmatrix} C(X,X)~ C(X,\mathbf{x}) \\ C(\mathbf{x},X)~C(\mathbf{x},\mathbf{x}) \end{bmatrix}$

Using the notation from the previous subsection we have $$A = C(X,X)$$ and $$B=C(X,\mathbf{x})$$, and $$E$$ and $$G$$ will be submatrices of the full $$K^{-1}$$. $$B$$ is a column vector, so I’ll write it as a vector $$\mathbf{b}$$, and $$G$$ is a row vector, so I’ll write it as a vector $$\mathbf{g}^T$$. So we have $C(X,X)^{-1} = E(I-C(X,x)G)^{-1}$ So if we want to calculate $A^{-1} = E (I - \mathbf{b}\mathbf{g}^T) ^{-1}$ we still have to invert $$I-BG$$, which is a large matrix. However this can be done efficiently since it is a rank one matrix using the Sherman-Morrison formula. $(I - \mathbf{b}\mathbf{g}^T)^{-1} = I^{-1} - \frac{I^{-1}\mathbf{b}\mathbf{g}^TI^{-1}}{1+\mathbf{g}^TI^{-1}\mathbf{b}} = I - \frac{\mathbf{b}\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}}$ Thus we have the shortcut for $$A^{-1}$$ that is only multiplication $A^{-1} = E (I - \frac{\mathbf{b}\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}})$

To speed this up it should be calculated as

$A^{-1} = E - \frac{(E\mathbf{b})\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}}$ Below demonstrates that we get a speedup of almost twenty by using this shortcut.

set.seed(0)
corr <- function(x,y) {exp(sum(-30*(x-y)^2))}
n <- 200
d <- 2
X <- matrix(runif(n*d),ncol=2)
R <- outer(1:n,1:n, Vectorize(function(i,j) {corr(X[i,], X[j,])}))
Rinv <- solve(R)
A <- R[-n,-n]
Ainv <- solve(A)
E <- Rinv[-n, -n]
b <- R[n,-n]
g <- Rinv[n,-n]
Ainv_shortcut <- E + E %*% b %*% g / (1-sum(g*b))
summary(c(Ainv - Ainv_shortcut))
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## -4.768e-03 -2.300e-08  0.000e+00  0.000e+00  2.300e-08  3.838e-03
if (requireNamespace("microbenchmark", quietly = TRUE)) {
microbenchmark::microbenchmark(solve(A), E + E %*% b %*% g / (1-sum(g*b)))
}
## Unit: microseconds
##                                expr      min       lq      mean   median
##                            solve(A) 6163.212 6297.584 6941.4747 6444.308
##  E + E %*% b %*% g/(1 - sum(g * b))  283.947  316.067  857.7904  346.287
##         uq      max neval
##  7480.3165 10443.33   100
##   389.9995 42558.63   100

In terms of the covariance matrices already calculated, this is the following, where $$M_{-i}$$ is the matrix $$M$$ with the ith row and column removed, and $$M_{i,-i}$$ is the ith row of the matrix $$M$$ with the value from the ith column removed.

$R(X_{-i})^{-1} = R(X)_{-i} - \frac{(R(X)_{-i}R(X)_{-i,i}) (R(X)^{-1})_{i,-i}^T }{1 + (R(X)^{-1})_{i,-i}^T R(X)_{-i,i}}$

### Leave-one-out prediction

Recall that the predicted mean at a new point is $\hat{y} = \hat{\mu} + R(\mathbf{x},~X_2) R(X_2)^{-1}( \mathbf{y_2} - \mu\mathbf{1_n}))$

$\hat{y} = \hat{\mu} + R(\mathbf{x_i},~X_{-1}) R(X_{-i})^{-1}( \mathbf{y_{-i}} - \mu\mathbf{1_n}))$