# Robust epistasis detection with epiGWAS

#### 2019-05-21

In this vignette, we cover the epistasis detection methods implemented in this package. The methods can be partitioned into two main categories: those based on modified outcome, and those based on outcome weighted learning. Both methods recover pure epistatic interactions with a predetermined variant, referred to as the target. The target can be drawn from the literature, experiments or top hits in previous GWAS. Narrowing the scope around a single variant is made possible by propensity scores (Rosenbaum and Rubin 1983) which, for genomic data, model the linkage disequilibrium (LD) dependency between the target and neighboring variants. We include them differently in outcome weighted learning and modified outcome in order to identify the SNPs interacting with the target.

The methods are briefly reviewed in this vignette before showcasing their performance on a dataset of simulated genotypes. For more details, we refer the prospective user to (Slim et al. 2018).

## Phenotype-genotype decomposition

We first consider a triplet of random variables $$\left(X, A, Y\right)$$:

• $$Y$$ denotes a binary (for case-control studies) or continuous phenotype (for quantitative GWAS).
• $$X= \left(X_1,\cdots, X_p\right) \in \lbrace 0, 1, 2\rbrace^{p}$$ represents a genotype with $$p$$ single-nucleotide polymorphisms (SNPs). $$X_j$$ encodes the number of minor alleles of SNP j (allelic dosage).
• $$A$$ is a $$(p+1)$$-th SNP that is encoded as $$\lbrace -1, +1\rbrace$$. Let $$\underline{A} = (A+1)/2$$ be a second binarized version of $$A$$ with values in $$\lbrace 0,+1\rbrace$$. Depending on the binarization rule for the SNP values $$\lbrace 0, 1, 2\rbrace$$, we can model both dominant and recessive mechanisms.

The symmetric encoding of $$A$$ allows the following genotype-phenotype decomposition: $Y = \mu(X) + \delta(X)\cdot A + \epsilon,$ where $$\epsilon$$ is a zero mean random variable and \left\{ \begin{aligned} \mu (X) &= \frac{1}{2}\left[\mathbb{E}(Y\lvert A=+1,X)+\mathbb{E}(Y\lvert A=-1,X)\right] \,,\\ \delta (X) &= \frac{1}{2}\left[\mathbb{E}(Y\lvert A=+1,X)-\mathbb{E}(Y\lvert A=-1,X)\right] \,. \end{aligned} \right.

The above decomposition separates the main effects term $$\mu(X)$$ from $$\delta(X)\cdot A$$, which models the pure epistatic effects of the SNPs in $$X$$ with the target SNP $$A$$. Under a sparsity assumption for $$\delta(X)$$, detecting epistasis amounts to recovering the support of $$\delta(X)$$. However, for a given sample, we only observe one of the two possibilities (either $$A = +1$$ or $$A = -1$$), making it impossible to directly estimate the term $$\delta(X)$$. To overcome this problem, we make use of the propensity score $$\pi(A\lvert X)$$. Mathematically speaking, it corresponds to the conditional probability of $$A$$ given $$X$$. In our case, where $$A$$ and $$X$$ are SNPs, $$\pi(A\lvert X)$$ models the LD between $$A$$ and $$X$$. The first category of methods, which we call modified outcome, incorporates $$\pi(A\lvert X)$$ in the outcome. The second category, outcome weighted learning, includes them in the sample weights along with the phenotype $$Y$$. Both categories are penalized regression approaches to which we apply a stability selection procedure (Meinshausen and Bühlmann 2010) for support estimation.

## Modified outcome

In modified outcome, we substitute $$A$$ with $$\underline{A}$$ to rewrite $$\delta(X)$$ in the following way: $\delta(X) = \mathbb{E} \left[Y\left(\frac{\underline{A}}{\pi(\underline{A}=1\lvert X)} - \frac{1 - \underline{A}}{\pi(\underline{A}=0\lvert X)}\right)\Bigg\lvert X\right]$

Let $$\underline{Y}$$ denote the modified outcome: $\underline{Y} = Y\left(\frac{\underline{A}}{\pi(\underline{A}=1\lvert X)} - \frac{1 - \underline{A}}{\pi(\underline{A}=0\lvert X)}\right)$

The risk difference term $$\delta(X)$$ is then simplified to: $\delta(X) = \frac{1}{2}\mathbb{E}\left[\underline{Y}\lvert X\right]$

We can then recover the support of $$\delta(X)$$ by applying a model selection procedure to the penalized regression problem where the sample covariates are $$X$$ and the outcome is $$\underline{Y}$$. However, in case of misspecification of the propensity score, modified outcome may suffer from numerical instability. We therefore propose three extensions to help mitigate this effect. The first extension, shifted modified outcome, consists in the addition of a regularization term $$\xi$$ to the inverses of the propensity scores i.e. $$1/(\pi(A\lvert X) + \xi)$$. The second proposition, normalized modified outcome, respectively normalizes $$\underline{A}/\pi(\underline{A} = 1\lvert X)$$ and $$(1-\underline{A})/\pi(\underline{A} = 0\lvert X)$$ by their sums, $$\sum_{i=1}^{n} \dfrac{\underline{A}^{(i)}}{\pi(\underline{A}^{(i)} = 1\lvert X^{(i)})}$$ and $$\sum_{i=1}^{n} \dfrac{1-\underline{A}^{(i)}}{\pi(\underline{A}^{(i)} = 0\lvert X^{(i)})}$$. The last but certainly not least proposition is robust modified outcome (see Slim et al. 2018; also Lunceford and Davidian 2004). In extensive simulations (Slim et al. 2018), it outperformed not only the other approaches within the modified outcome family, but also BOOST (Wan et al. 2010), a state-of-the-art method for epistasis detection.

## Outcome weighted learning

In outcome weighted learning, instead of the estimation of the difference of $$\mathbb{E}(Y\lvert A = +1, X)$$ and $$\mathbb{E}(Y\lvert A = -1, X)$$, we predict their $$\log$$-ratio:

$d(X) = \ln \frac{\mathbb{E}(Y\lvert A = +1, X)}{\mathbb{E}(Y\lvert A = +1, X)}$

The verification of $$\text{sign}\; \delta(X) = \text{sign}\; d(X)$$ is straightforward. This makes outcome weighted learning a relaxation of modified outcome. However, the regression models are completely unrelated. Outcome weighted learning is a weighted binary classification problem where the sample weights are $$Y/\pi(A\lvert X)$$, the outcome is the target $$A$$ and the covariates remain $$X$$. Without regularization, the use of the inverses of propensity scores can also result in numerical instability.

## Case study

Now that we have exposed the theoretical grounds of our methods, we explain how to use them in practice for epistasis detection. For that purpose, we illustrate their usage on a synthetic dataset included with this package. Using HAPGEN2 (Su, Marchini, and Donnelly 2011), we simulated $$450$$ SNPs on the $$22^{nd}$$ chromosome between the nucleotide positions $$16061016$$ and $$19976834$$ in the GRCh37 coordinates. The prior QC steps to control for rare variants ($$\text{MAF} < 0.01$$) and Hardy–Weinberg equilibrium ($$p < 10^{-6}$$) have already been performed. The simulated genotypes are saved as an integer matrix. We also included their minor allele frequencies (MAFs).

The first step in the pipeline is to load the genotypes matrix and the SNP MAFs vector.

require("epiGWAS")
data(genotypes)
data(maf)

### Linkage disequilibrium around the target

To alleviate issues of linkage disequilibrium around the target $$A$$ and avoid overfitting in the estimation of the propensity scores $$\pi(A\lvert X)$$, we remove all SNPs within a certain window of $$A$$. On each side of the target, the width of the window is of three clusters. The clusters are the result of an unsupervised clustering procedure such as hierarchical clustering. Compared to fixed-size windows, such dynamic windows allow to better account for genetic architecture.

set.seed(347)
sigma <- cor(genotypes)
sigma_distance <- as.dist(1 - abs(sigma))
hc <- hclust(sigma_distance, method = "single")
corr_max <- 0.5
clusters <- cutree(hc, h = 1 - corr_max)

### Genotype construction

After the clustering procedure, we can sample the causal SNPs. Beside the target, the other causal variants are sampled outside of the LD window. In total, we sample $$80$$ SNPs that interact with the target, $$20$$ SNPs with marginal effects and $$20$$ additional SNP pairs with epistatic effects. Moreover, among the $$80$$ synergistic SNPs, $$10$$ also have separate marginal effects and another $$5$$ have additional epistatic effects (with another SNP than the target). The sample_SNP function samples at most one causal SNP per cluster to avoid duplication of effects. Despite the high number of SNPs selected to be causal, the problem is still imbalanced with $$80$$ out of $$414$$ SNPs being causal.

# Parameterization of the disease model
window_target <- 3 # Width of the LD window on each side of the target
nX <- 80 # Number of SNPs interacting with the target
nY <- 20 # Number of SNPs with marginal effects
nZ12 <- 20 # Number of SNP pairs with epistatic effects
overlap_marg <- 10 # Number of SNPs interacting with the target in addition to having marginal effects
overlap_inter <- 5 # Number of SNPs interacting with the target in addition to having epistatic effects
set.seed(347)
causal <- sample_SNP(
nX, nY, nZ12,
clusters, maf, thresh_MAF = 0.01,
window_size = window_target, overlap_marg = overlap_marg, overlap_inter = overlap_inter
)
clusters <- merge_cluster(clusters, center = clusters[causal$target], k = window_target) Finally, we only retain the target from all SNPs in its surrounding LD window: genotypes <- genotypes[, (clusters != clusters[causal$target]) | (colnames(genotypes) == causal$target)] ### Phenotype simulation The phenotypes are simulated according to a logistic model in which the effect sizes are sampled from a normal distribution of mean $$0$$ and standard deviation $$1$$. set.seed(347) model <- gen_model(nX, nY, nZ12, mean = rep(0, 4), sd = rep(1, 4)) # Sampling of size effects phenotype <- sim_phenotype(genotypes, causal, model) # Phenotype simulation ### Epistasis detection As the fastPHASE software cannot be included with this package, we directly provide the propensity scores vector. The results can be reproduced by running fast_HMM with the dimensionality of the latent space n_state = 10 and the number of iterations for the EM algorithm n_iter = 20. Before applying our epistasis detection methods, we separate the target $$A$$ from the rest of the genotype, denoted here by $$X$$: data("propensity") A <- genotypes[, causal$target] > 0
X <- genotypes[, colnames(genotypes) != causal\$target]

We now run all methods with their default settings, which generally offers a good trade-off between speed and inference performance

stability_scores <- epiGWAS(A, X, phenotype, propensity,
methods = c("OWL", "modified_outcome", "shifted_outcome",
"normalized_outcome", "robust_outcome"),
parallel = FALSE)

The last step is to evaluate the epistasis detection performance in terms of the areas under the receiver operating characteristic (ROC) and precision/recall (PR) curves.

#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang
Areas under the ROC and PR curves
Method ROC PRC
Modified outcome 0.7010269 0.3399078
Normalized modified outcome 0.6680241 0.3237371
OWL 0.5243626 0.1959425
Robust modified outcome 0.6432897 0.2944554
Shifted modified outcome 0.6751771 0.3079225

The results are perfectly concordant with our findings in (Slim et al. 2018). Among the five methods, robust modified outcome is obviously the best performer in terms of areas under both the ROC and the PR curves.

## References

Lunceford, Jared K., and Marie Davidian. 2004. “Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study.” Statistics in Medicine 23 (19): 2937–60. https://doi.org/10.1002/sim.1903.

Meinshausen, Nicolai, and Peter Bühlmann. 2010. “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (4): 417–73. https://doi.org/10.1111/j.1467-9868.2010.00740.x.

Rosenbaum, Paul R, and Donald B Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.

Slim, Lotfi, Clément Chatelain, Chloé-Agathe Azencott, and Jean-Philippe Vert. 2018. “Novel Methods for Epistasis Detection in Genome-Wide Association Studies.” bioRxiv, January. http://biorxiv.org/content/early/2018/10/14/442749.

Su, Zhan, Jonathan Marchini, and Peter Donnelly. 2011. “HAPGEN2: Simulation of Multiple Disease SNPs.” Bioinformatics 27 (16): 2304–5. https://doi.org/10.1093/bioinformatics/btr341.

Wan, Xiang, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan, Nelson L. S. Tang, and Weichuan Yu. 2010. “BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-Wide Case-Control Studies.” The American Journal of Human Genetics 87 (3): 325–40. https://doi.org/10.1016/j.ajhg.2010.07.021.