Controlled variable Selection with Model-X Knockoffs

2018-09-15

This vignette illustrates the basic usage of the knockoff package with Model-X knockoffs. In this scenario we assume that the distribution of the predictors is known (or that it can be well approximated), but we make no assumptions on the conditional distribution of the response. For simplicity, we will use synthetic data constructed from a linear model such that the response only depends on a small fraction of the variables.

set.seed(1234)
# Problem parameters
n = 1000          # number of observations
p = 1000          # number of variables
k = 60            # number of variables with nonzero coefficients
amplitude = 4.5   # signal amplitude (for noise level = 1)

# Generate the variables from a multivariate normal distribution
mu = rep(0,p)
rho = 0.25
Sigma = toeplitz(rho^(0:(p-1)))
X = matrix(rnorm(n*p),n) %*% chol(Sigma)

# Generate the response from a linear model
nonzero = sample(p, k)
beta = amplitude * (1:p %in% nonzero) / sqrt(n)
y.sample = function(X) X %*% beta + rnorm(n)
y = y.sample(X)

First examples

To begin, we call knockoff.filter with all the default settings.

library(knockoff)
result = knockoff.filter(X, y)

We can display the results with

print(result)
## Call:
## knockoff.filter(X = X, y = y)
## 
## Selected variables:
##  [1]   3   9  40  44  46  61  67  78  85 108 148 153 172 173 177 210 223
## [18] 238 248 281 295 301 302 317 319 326 334 343 360 364 378 384 389 421
## [35] 426 428 451 494 506 510 528 534 557 559 595 596 617 668 676 682 708
## [52] 770 775 787 836 844 875 893 906 913 931 937 953 959

The default value for the target false discovery rate is 0.1. In this experiment the false discovery proportion is

fdp = function(selected) sum(beta[selected] == 0) / max(1, length(selected))
fdp(result$selected)
## [1] 0.171875

By default, the knockoff filter creates model-X second-order Gaussian knockoffs. This construction estimates from the data the mean \(\mu\) and the covariance \(\Sigma\) of the rows of \(X\), instead of using the true parameters (\(\mu, \Sigma\)) from which the variables were sampled.

The knockoff package also includes other knockoff construction methods, all of which have names prefixed withknockoff.create. In the next snippet, we generate knockoffs using the true model parameters.

gaussian_knockoffs = function(X) create.gaussian(X, mu, Sigma)
result = knockoff.filter(X, y, knockoffs=gaussian_knockoffs)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = gaussian_knockoffs)
## 
## Selected variables:
##  [1]   3   9  40  46  61  85 108 148 153 172 173 177 210 223 238 248 281
## [18] 295 301 302 319 326 334 343 360 364 378 384 389 421 426 428 451 494
## [35] 506 510 528 538 557 559 595 668 676 682 702 708 770 775 787 836 844
## [52] 893 906 913 931 937 953 959

Now the false discovery proportion is

fdp(result$selected)
## [1] 0.1034483

By default, the knockoff filter uses a test statistic based on the lasso. Specifically, it uses the statistic stat.glmnet_coefdiff, which computes \[ W_j = |Z_j| - |\tilde{Z}_j| \] where \(Z_j\) and \(\tilde{Z}_j\) are the lasso coefficient estimates for the jth variable and its knockoff, respectively. The value of the regularization parameter \(\lambda\) is selected by cross-validation and computed with glmnet.

Several other built-in statistics are available, all of which have names prefixed with stat. For example, we can use statistics based on random forests. In addition to choosing different statistics, we can also vary the target FDR level (e.g. we now increase it to 0.2).

result = knockoff.filter(X, y, knockoffs = gaussian_knockoffs, statistic = stat.random_forest, fdr=0.2)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = gaussian_knockoffs, 
##     statistic = stat.random_forest, fdr = 0.2)
## 
## Selected variables:
##  [1]   9  85 108 148 172 173 210 223 248 301 343 378 421 426 428 559 595
## [18] 668 708 785 931 953 959
fdp(result$selected)
## [1] 0.04347826

User-defined test statistics

In addition to using the predefined test statistics, it is also possible to use your own custom test statistics. To illustrate this functionality, we implement one of the simplest test statistics from the original knockoff filter paper, namely \[ W_j = \left|X_j^\top \cdot y\right| - \left|\tilde{X}_j^\top \cdot y\right|. \]

my_knockoff_stat = function(X, X_k, y) {
  abs(t(X) %*% y) - abs(t(X_k) %*% y)
}
result = knockoff.filter(X, y, knockoffs = gaussian_knockoffs, statistic = my_knockoff_stat)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = gaussian_knockoffs, 
##     statistic = my_knockoff_stat)
## 
## Selected variables:
## integer(0)
fdp(result$selected)
## [1] 0

As another example, we show how to customize the grid of \(\lambda\)’s used to compute the lasso path in the default test statistic.

my_lasso_stat = function(...) stat.glmnet_coefdiff(..., nlambda=100)
result = knockoff.filter(X, y, knockoffs = gaussian_knockoffs, statistic = my_lasso_stat)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = gaussian_knockoffs, 
##     statistic = my_lasso_stat)
## 
## Selected variables:
##  [1]   3   9  40  44  46  61  78  85 108 148 153 172 173 177 210 223 238
## [18] 248 281 295 301 302 319 326 334 343 360 364 378 384 389 421 426 428
## [35] 451 494 506 528 538 557 559 595 596 668 682 702 708 770 775 836 844
## [52] 893 906 913 931 937 953 959
fdp(result$selected)
## [1] 0.1206897

The nlambda parameter is passed by stat.glmnet_coefdiff to the glmnet, which is used to compute the lasso path. For more information about this and other parameters, see the documentation for stat.glmnet_coefdiff or glmnet.glmnet.

User-defined knockoff generation functions

In addition to using the predefined procedures for construction knockoff variables, it is also possible to create your own knockoffs. To illustrate this functionality, we implement a simple wrapper for the construction of second-order Model-X knockoffs.

create_knockoffs = function(X) {
  create.second_order(X, shrink=T)
}
result = knockoff.filter(X, y, knockoffs=create_knockoffs)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = create_knockoffs)
## 
## Selected variables:
##  [1]   9  40  44  61  78  85 108 148 153 172 173 177 210 223 238 248 295
## [18] 301 319 326 334 343 360 364 378 384 389 421 426 428 451 494 506 557
## [35] 559 595 651 668 702 708 770 775 844 893 906 913 931 937 953 959
fdp(result$selected)
## [1] 0.08

Approximate vs Full SDP knockoffs

The knockoff package supports two main styles of knockoff variables, semidefinite programming (SDP) knockoffs (the default) and equi-correlated knockoffs. Though more computationally expensive, the SDP knockoffs are statistically superior by having higher power. To create SDP knockoffs, this package relies on the R library [Rdsdp][Rdsdp] to efficiently solve the semidefinite program. In high-dimensional settings, this program becomes computationally intractable. A solution is then offered by approximate SDP (ASDP) knockoffs, which address this issue by solving a simpler relaxed problem based on a block-diagonal approximation of the covariance matrix. By default, the knockoff filter uses SDP knockoffs if \(p<500\) and ASDP knockoffs otherwise.

In this example we generate second-order Gaussian knockoffs using the estimated model parameters and the full SDP construction. Then, we run the knockoff filter as usual.

gaussian_knockoffs = function(X) create.second_order(X, method='sdp', shrink=T)
result = knockoff.filter(X, y, knockoffs = gaussian_knockoffs)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = gaussian_knockoffs)
## 
## Selected variables:
##  [1]   9  40  46  61  78  85 108 148 172 173 177 210 223 238 248 281 295
## [18] 301 302 317 319 326 334 343 360 364 378 384 389 421 426 428 451 494
## [35] 506 528 534 557 559 595 596 668 682 702 708 770 775 844 893 906 913
## [52] 931 937 953 959
fdp(result$selected)
## [1] 0.1090909

Equi-correlated knockoffs

Equicorrelated knockoffs offer a computationally cheaper alternative to SDP knockoffs, at the cost of lower statistical power. In this example we generate second-order Gaussian knockoffs using the estimated model parameters and the equicorrelated construction. Then we run the knockoff filter.

gaussian_knockoffs = function(X) create.second_order(X, method='equi', shrink=T)
result = knockoff.filter(X, y, knockoffs = gaussian_knockoffs)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = gaussian_knockoffs)
## 
## Selected variables:
##  [1]   3   9  40  44  46  61  78  85 108 146 148 153 172 173 177 210 223
## [18] 238 248 281 295 301 319 326 334 343 360 364 378 384 389 421 426 428
## [35] 451 494 506 528 557 559 595 596 617 651 668 682 702 708 770 775 787
## [52] 844 875 893 906 913 931 937 953 959
fdp(result$selected)
## [1] 0.1333333

See also

If you want to look inside the knockoff filter, see the advanced vignette. If you want to see how to use knockoffs for Fixed-X variables, see the Fixed-X vignette.