Abstract

Association measures can be local or global. Local association measures quantify the association between specific values of random variables (The

`zebu`

R package allows estimation of local association measures and implements local association subgroup analysis, and thus, fills an unmet need. It is of interest to a wide range of scientific disciplines such as health and computer sciences and can be used by anyone with a basic knowledge of the R language. It is provided under a GLP-3 license; source code is available at http://github.com/olivmrtn/zebu.Keywords: measure of association, statistical independence, local association, local association subgroup analysis, pointwise mutual information, Ducher’s Z, R package

- Introduction
- Background on Association and Independence
- Local Association Measures
- Local Association Subgroup Analysis
- User’s Guide - An Example with Simulated Data: Drug Resistance
- Future Research and Development
- Competing Interests
- Authors’ Contributions
- Acknowledgements
- References

Science is concerned with the explanation of phenomena. This involves establishing causal relationships between variables. However, the underlying causes are not directly observable. To reveal them, one must conduct carefully planned experiments to observe the consequences of causal processes: complicated patterns of independence and association between variables. Although association (*i.e.* correlation) is not a sufficient condition to establish causation, it can be used as a guide for further investigation. Indeed, causation always implies some pattern of association (Shipley 2000). For this reason, the nature and strength of associations have to be thoroughly described and measured. Accordingly, numerous measures of association, such as Pearson’s r and the chi-square, have been described in the literature (Pecina 2008).

Association measures can be local or global (Van de Cruys 2011). Local association measures quantify the association for specific values of random variables. In the case of a contingency table, they yield one value for each cell. An example is chi-square residuals that are computed when constructing a chi-square test. On the other hand, global association measures yield a single value used to summarize the association for all values taken by random variables. An example is the chi-square statistic, the sum of squared chi-square residuals (Sheskin 2007).

Most often, we are only concerned with the global association and overlook local association. For example, analysis of chi-square residuals is uncommon practice when compared to the chi-square independence test. Nonetheless, a significative global association can hide a non-significative local association, and a non-significative global association can hide a significative local association (Anselin 1995). Accordingly, analysis of the association should not limit itself with the global perspective. Indeed, the association between two variables can depend on their values. For example, in threshold mechanisms, variables are only associated with each other above a certain critical value. In this case, local association measures allow pinpointing values for which variables are associated. Moreover, the existence of an association between two variables may depend on the value of a third variable. For example, the effect of a drug will depend on the patient’s sensibility to the drug. The local association between drug intake and recovery will not be the same for patients that are sensitive then to those that are resistant to the drug. They form two different local association subgroups. Comparison of these subgroups in terms of other variables may help explain their differences. We will refer to this procedure as local association subgroup analysis.

Local association measures are uncommonly used in the scientific literature, in terms of the number of articles. Nonetheless, the diversity of fields interested in these measures demonstrates their significance. Indeed, applications are found in computational linguistics (Van de Cruys 2011), image processing (Isola et al. 2014), health sciences such as cardiology (Sapoznikov, Dranitzki Elhalel, and Rubinger 2013) and geography (Anselin 1995). Notwithstanding, software presently available only allows computation of global association measures. This is why we have developed the `zebu`

R package described in this paper. It is provided under a GLP-3 license; source code is available at http://github.com/olivmrtn/zebu.

The rest of the paper is organized as following. We first give the reader the necessary intuition and mathematical background. This leads to the description of Ducher’s Z (M. Ducher et al. 1994) and pointwise mutual information (Van de Cruys 2011). We introduce multivariate forms of these measures and suggest a normalization scheme for pointwise mutual information. We then present local association subgroup analysis. Subsequently, we illustrate the usage of local association measures and local subgroup analysis using the `zebu`

R package. This will be undertaken using an example with simulated data about drug resistance. The paper ends with remarks on future development and research.

Throughout the paper, we will suppose that all random variables are discrete and write them in capital letters such as \(A\) and \(B\). Possible values of these random variables (*i.e* events) will be written in lower letters such as \(a\) and \(b\).

One way to think about statistical association is as events co-occurring. For example, if event \(a\) always occurs with event \(b\), then these events are said to be associated. An intuitive measure of association could be the joint probability: \(p(a, b)\), the long-term frequency of events showing up together. However, this measure fails if \(a\) or \(b\) is a rare event. Indeed, joint probabilities are always as small as its individual events are rare: \(p(a, b) \leq \min p(a), p(b)\). As a consequence, it is necessary to compare the *observed* probabilities \(p(a, b)\) to *expected* probabilities in which the variables are considered independent. The expected probability, if events were independent, is the factor of marginalized probabilities of events: \(p(a) p(b)\). Independence is then defined by the following mathematical relation: \(p(a, b) = p(a) p(b)\): local association measures are equal to zero.

Independence implies that knowing one or more variables does not give us any information about the others. This is exactly what we are not interested in. It is, however, possible define two interesting cases where the former equality does not hold: co-occurrence and mutual exclusivity. Co-occurrence, or positive association, is defined as events showing up more often than expected: \(p(a, b) > p(a) p(b)\): local association measures are positive valued. Mutual exclusivity, or negative association, is defined as events showing up less often than expected: \(p(a, b) < p(a) p(b)\): local association measures are negative valued.

It should, however, be noted that statistical independence is not the only manner to construct an association measure. Other possibilities are based on the proportion of explained variance such as Pearson’s R. These former measures are parametric and suppose linear or at least monotone relationships between variables. Although intuitive and convenient, this assumption is not always justified. Measures based on statistical independence provide a non-parametric alternative that can detect non-linear relationships.

For each combination of events \(a\) and \(b\), their local association can be estimated. This is accomplished by comparing the observed from the expected probability of \(a\) and \(b\). If these probabilities are equal, then events \(a\) and \(b\) are independent. If not, these events are associated; the sign of the measure indicates the orientation of the relationship, and the absolute value indicates its strength. There are different measures to compare observed and expected probabilities, for example, by using subtraction and division. Here under, we define the difference noted \(dif\) and the pointwise mutual information noted \(pmi\) (Van de Cruys 2011). To simplify notation, and to show similarities between the two measures, we define \(h(a) = - \log p(a)\) as the self-information of \(a\).

\[ \begin{aligned} dif(a, b) & = p(a, b) - p(a) p(b) \\ pmi(a, b) & = \log \frac{p(a, b)} {p(a) p(b)} = - (h(a, b) - h(a) - h(b)) \end{aligned} \]

The bounds of these measures are dependent on the marginal probabilities: \(p(a)\) and \(p(b)\). In particular, they are dependent with the minimal marginal probability \(\min p(a), p(b)\) because \(p(a, b) \leq \min p(a), p(b)\). This makes it difficult to compare values for different combinations of events. In that respect, it is desirable to normalize these measures so that they only take values between -1 and 1 included. This can be solved by using dividing the non-normalized values by their minimal or maximal values. Let us first identify the minimal and maximal values of \(dif\) and \(pmi\).

The bounds of the observed probability \(p(a, b)\) are \([0, \min p(a), p(b)]\). This means that \(dif\) and \(pmi\) are minimized when \(p(a, b) = 0\).

\[ \begin{aligned} \min dif(a, b) & = - p(a) p(b) \\ \min pmi(a, b) & = \lim_{p(a, b) \to 0} pmi(a, b) = -\infty \end{aligned} \]

Similarly, \(dif\) and \(pmi\) are maximized when \(p(a, b) = \min p(a), p(b)\).

\[ \begin{aligned} \max dif(a, b) & = \min(p(a), p(b)) - p(a) p(b) \\ \max pmi(a, b) & = \log \frac{\min p(a), p(b)}{p(a) p(b)} = - (\min(h(a), h(b)) - h(a) - h(b)) \end{aligned} \]

By dividing by maximal and minimal values, we can normalize \(dif\). We will refer to the normalized \(dif\) by the capital \(Z\) because it corresponds to Ducher’s \(Z\) (M. Ducher et al. 1994).

\[ Z(a, b) = \begin{cases} \frac{ dif(a, b) }{ \max z(a, b) } = \frac{ p(a, b) - p(a) p(b) }{ \min(p(a), p(b)) - p(a) p(b) } & dif(a, b) > 0 \\ \\ \frac{ dif(a, b) }{ - \min dif(a, b) } = \frac{ p(a, b) - p(a) p(b) }{ p(a) p(b) } & dif(a, b) < 0 \\ \\ 0 & dif(a, b) = 0 \end{cases} \]

A normalization scheme for \(pmi\) has already been suggested by Bouma (2009). Nonetheless, this scheme does not hold for more than two variables. Accordingly, we suggest using the normalization scheme used for Ducher’s Z so that it holds in the multivariate case. Normalization of the negative case of \(pmi\) is more subtle because \(pmi(a, b)\) tends to \(\infty\) when \(pi(a, b)\) tends to 0. Nonetheless, dividing \(pmi(a, b)\) by \(- h(a, b)\) solves this problem by making \(npmi(a, b)\) tend to -1 when \(p(a, b)\) tends to 0.

\[ npmi(a, b) = \begin{cases} \frac{pmi(a, b)}{\max pmi(a, b)} = \frac{ h(a, b) - h(a) - h(b) }{ \min(h(a), h(b)) - h(a) - h(b) } & pmi(a, b) > 0 \\ \\ \frac{ pmi(a, b) }{- h(a, b) } = \frac{ h(a, b) - h(a) - h(b) }{ h(a, b) } & pmi(a, b) < 0 \\ \\ 0 & pmi(a, b) = 0 \end{cases} \]

In the `zebu`

package, it is possible to estimate Ducher’s \(Z\), \(pmi\) and \(npmi\) using the `lassie`

function that returns a `lassie`

S3 object.

Global association measures yield a single value used to summarize the association for all values taken by the random variables. For example, mutual information is computed as the sum for all events of their observed probability times their pointwise mutual information. All global association measures in `zebu`

are defined likewise.

\[ \begin{aligned} gZ(A, B) &= \sum_{a, b} p(a, b) z(a, b) \\ MI(A, B) &= \sum_{a, b} p(a, b) pmi(a, b) \\ NMI(A, B) &= \sum_{a, b} p(a, b) npmi(a, b) \\ \end{aligned} \]

It is important to distinguish the strength of association from its statistical significance. Indeed, a strong association can be non-significant (*e.g.* Beer–Lambert attenuation coefficient and concentration of material with small sample size) and a weak association can be significant (*e.g.* epidemiological risk factor with big sample size). Significance can be accessed using p-values estimated using the theoretical null distribution of the local association measure or by resampling techniques (Sheskin 2007).

In the `zebu`

package, p-values are estimated by a permutation test. This can be accessed using the `permtest`

function that returns a `lassie`

and `permtest`

S3 object.

To derive multivariate forms of these local association measures we assume that events are mutually independent. This means that for \(n\) random variables \(X_1, \ldots, X_n\), independence is defined by: \(p(x_1, \ldots, x_n) = \prod_{i=1}^{n} p(x_i)\). The same reasoning is applied and the following formulas are identified.

\[ Z(x_1, \ldots, x_n) = \begin{cases} \frac{ p(x_1, \ldots, x_n) - \prod_{i=1}^{n} p(x_i) }{ \min(p(x_1), \ldots, p(x_n)) - \prod_{i=1}^{n} p(x_i) } & dif(x_1, \ldots, x_n) > 0 \\ \\ \frac{ p(x_1, \ldots, x_n) - \prod_{i=1}^{n} p(x_i) }{ \prod_{i=1}^{n} p(x_i) } & dif(x_1, \ldots, x_n) < 0 \\ \\ 0 & dif(x_1, \ldots, x_n) = 0 \end{cases} \]

\[ npmi(x_1, \ldots, x_n) = \begin{cases} \frac{ h(x_1, \ldots, x_n) - \sum_{i=1}^{n} h(x_i) }{ \min(h(x_1), \ldots, h(x_n)) - \sum_{i=1}^{n} h(x_i) } & pmi(x_1, \ldots, x_n) > 0 \\ \\ \frac{ h(x_1, \ldots, x_n) - \sum_{i=1}^{n} h(x_i) }{ h(x_1, \ldots, x_n) } & pmi(x_1, \ldots, x_n) < 0 \\ \\ 0 & pmi(x_1, \ldots, x_n) = 0 \end{cases} \]

These multivariate association measures may help identify complex association relationships that cannot be identified only with bivariate association measures. For example, in the XOR gate, the output of the gate is not associated with any of the two inputs individually (Jakulin and Bratko 2003). The association is only revealed when the two inputs and the output are taken together.

In order to describe this methodology, an illustrative example concerning salt consumption and blood pressure will be discussed. This is widely inspired from M Ducher et al. (2003). Blood pressure is thought to be linearly related to salt consumption. However, evidence supporting this association of variables is widely contradictory (Freedman and Petitti 2001). This suggests that a global relationship may not be applicable to all individuals, but rather only to a subgroup of salt-sensitive individuals. These are to be opposed to salt-resistant individuals for whom no relationship can be established (Kaplan 2010). Global association measures may not be sensitive enough because salt-resistant individuals “dilute” the association that exists for salt-sensitive individuals.

Local association measures allow quantifying association for specific values of salt consumption and blood pressure. Accordingly, individuals can be classified into three corresponding subgroups: independent, positive and negative local association. The positive subgroup corresponds to the subset of values that are well explained by the global association of variables (e.g. low blood pressure and low salt consumption, or high blood pressure and high salt consumption). The corresponding subgroup will thus be composed individuals statistically sensitive to salt. The negative subgroup corresponds to the subset of values badly explained by the global relationship (e.g. low blood pressure and high salt consumption). The corresponding subgroup will thus be composed of individuals statistically resistant to salt. Finally, the independent subgroup corresponds to values for which variables are independent. Once these local subgroups are formed, the global and local association between these subgroups and values of other variables can then be used to determine what distinguishes salt-sensitive from salt-resistant individuals. For example, one may find that young individuals are more resistant to salt (*i.e.* negative or independent subgroup associated to young age) than older individuals (*i.e.* positive subgroup associated to old age) (Weinberger 1996).

The goal of local association subgroup analysis is to identify values \(c\) of a random variable \(C\) for which the association between random variables \(A\) and \(B\) depends on. For this, we compute the local association \(L\) for all values of variables \(A\) and \(B\) using `lassie`

. It is then possible to define three subgroups in function of the value taken by \(L(a, b)\). The definition of these subgroups can also take into account p-values (as estimated by `permtest`

) to distinguish significantly associated values from independent values. In other words, this corresponds to merging variables \(A\) and \(B\) into a new variable \(S\) as following:

\[ \begin{aligned} Positive&: \{(a, b) \; |\ L(a, b) > 0 \} \\ Independant&: \{(a, b) \; |\ L(a, b) = 0 \} \\ Negative&: \{(a, b) \; |\ L(a, b) < 0 \} \\ \end{aligned} \]

The local association between subgroups \(S\) and another variable \(C\) can then be estimated. This allows us to identify values \(c\) of \(C\) that determines the association between \(A\) and \(B\). In the `zebu`

package, this procedure can be undertaken using the `subgroups`

function that returns a `lassie`

S3 object. Accordingly, the significance of association can be accessed using the `permtest`

function.

The relevance of local association measures and the usage of the `zebu`

package will be illustrated with a simulated dataset of a clinical trial in which patient recovery is dependent on drug intake and their resistance to the drug. Briefly, the dataset is composed of 100 sick patients that are randomly allocated to the placebo or the drug group. These patients are characterized by a resistance to the drug as modeled by a binary variable; only half of the patients are sensitive. The health status of patients is monitored through a biomarker that takes continuous values between 0 and 1. Patients with levels above 0.7 are considered as having recovered. Pretreatment levels are modeled by a normal distribution centered around 0.3. The drug has a mean positive effect of 0.6 on biomarker levels for drug-sensitive patients and no effect on resistant patients. The placebo has a positive mean effect of 0.3. For more details about the data simulation, see the `trial.R`

file in the `data-raw/`

folder of the R package.

The first step is to load the `zebu`

R package and the `trial`

dataset as following.

```
set.seed(63) # Set seed for reproducibility
library(zebu) # Load zebu
data(trial) # Load trial dataset
head(trial) # Show head of trial dataset
```

```
drug resistance prebiom postbiom
1 placebo sensitive 0.4273682 0.7497984
2 drug resistant 0.2395317 0.5099096
3 drug sensitive 0.2551785 0.8439521
4 drug sensitive 0.3165800 0.9934810
5 placebo sensitive 0.2989971 0.5741008
6 placebo resistant 0.3563302 0.6332050
```

The local (and global) association between drug intake and patient recovery can be estimated using the `lassie`

function. This function takes at least one argument: a `data.frame`

, here the `trial`

dataset.

Columns are selected using the `select`

arguments (column names or numbers). Variables are assumed to be categorical; continuous variables have to be specified using the `continuous`

argument and the number of discretization bins with the `breaks`

argument (as in the `cut`

function). The local association measure that we use here is Ducher’s Z as specified by setting the `measure`

argument equal to `"z"`

.

```
las <- lassie(trial,
select = c("drug", "postbiom"),
continuous = "postbiom",
breaks = c(0, 0.7, 1),
measure = "z")
```

The `permtest`

function accesses the significance of local (and global) association using a permutation test. The number of iterations is specified by `nb`

and the adjustment method of p-values for multiple comparisons by `p_adjust`

(as in the `p.adjust`

function). Parallelization of permutations iterations is made possible thanks to the `foreach`

and `doSNOW`

package and is deactivated by default. The number of cpus can be set using the `zebu.ncpus`

option. For example, to set the number of processes to 2: `options(zebu.ncpus = 2)`

. A progress bar is also available to make computations seem shorter than they actually are.

```
las <- permtest(las,
nb = 1000,
p_adjust = "BH",
parallel = FALSE,
progress_bar = FALSE)
```

The `lassie`

and `permtest`

functions returns a `lassie`

S3 object, as well as `permtest`

for `permtest`

. `lassie`

objects can be visualized using the `plot`

and `print`

methods. Moreover, results can be saved in CSV format using `write.lassie`

.

`print(las)`

```
Measure: Ducher's Z
Global: 0.471400966183575 (p-value=0)
postbiom
drug [0,0.7] (0.7,1]
drug -0.3043478 0.7777778
placebo 0.7777778 -0.7777778
```

`plot(las)`

The global association between drug intake and patient recovery is strong and statistically significant (\(p < \frac{1}{1000}\)). This would normally be interpreted as a positive effect of the drug on patient recovery. However, our simulation supposes that only 50% of patients are sensitive to the drug. The former conclusion would thus be wrong in 50% of cases. Inspection of local association is of help here.

There is no local association between the events ‘drug’ and ‘not recovering’. In plain English, this means that certain patients are insensitive (resistant) to the drug. Comparison of these patients with patients that exhibit positive (or negative) association may help identify differences between these two subgroups and explain why they are resistant to the drug. This can be done using local association subgroup analysis. Finally, note here that a non-significant local association can hide a significant global association.

Local association subgroup analysis can be called using the `subgroups`

function. Here we wish to compare the local association between drug intake and patient recovery according to the values of a third variable, patient drug resistance. `subgroups`

takes at least two arguments: a `lassie`

object `las`

(association between drug intake and patient recovery) and a `data.frame`

`x`

.

The same optional arguments as in the `lassie`

function, `select`

, `continuous`

and `breaks`

, can be specified. These refer to the `x`

dataset. Here, we only select the variable named `resistance`

. This could, for example, refer to the gene of the drug target or of some drug efflux protein.

The optional arguments `thresholds`

, `significance`

and `alpha`

specify how local association groups should be constructed. `thresholds`

specifies local association value thresholds for subgroups. `significance`

specifies if p-values should be taken into account and `alpha`

the corresponding p-value threshold (alpha error).

```
sub <- subgroups(las = las,
x = trial,
select = "resistance",
thresholds = c(-0.05, 0.05),
significance = TRUE,
alpha = 0.01)
```

Significance of local (and global) association between subgroups and drug resistance can be accessed using `permtest`

`sub <- permtest(sub)`

The `subgroups`

function also returns a `lassie`

S3 object with the same methods of interest: `print`

, `plot`

and `write.lassie`

`print(sub)`

```
Measure: Ducher's Z
Global: 0.495465339351549 (p-value=0)
drug_postbiom
resistance Negative Positive
resistant 0.8423956 -0.2762988
sensitive -0.8423956 0.8423956
```

`plot(sub)`

The global association between local association subgroups and drug resistance is strong and statistically significative. This indicates that the resistance variable has an influence on the association between drug intake and patient recovery. The local association indicates that drug-sensitive patients are over-represented in the positive local association subgroup. This shows that these patients exhibit a positive correlation between drug intake and recovery. Moreover, drug-resistant patients are over-represented in the independent local association subgroup. This shows that there is no correlation between drug intake and recovery for these patients. To state this in a trivial manner, only drug-sensitive patients are sensitive to the drug.

The number of variables that can be handled in the `zebu`

package is not limited. Hereunder, for illustration, we estimate the trivariate association between drug intake, recovery, and resistance. The `permtest`

function gives control on how to permute the dataset through the `group`

argument. `group`

is a list of `character`

s corresponding to `colnames`

. Permutations are performed per group meaning that the association structure is not broken within groups but only between them. In our case, we are studying the relation between `postbiom`

and `resistance`

with `drug`

and only want to break the association structure with the `drug`

response. but not between `postbiom`

and `resistance`

.

In this case, we obtain a multidimensional local association `array`

. Because of this, results cannot be plotted as a tile plot; the `plot`

method is not available. The `print`

method allows visualizing results by melting the `array`

into a `data.frame`

sorted by decreasing local association.

```
las2 <- lassie(trial,
select = c("drug", "postbiom", "resistance"),
continuous = "postbiom",
breaks = c(0, 0.7, 1))
las2 <- permtest(las2,
group = list("drug", c("postbiom", "resistance")), progress_bar = FALSE)
print(las2)
```

```
Measure: Ducher's Z
Global: 0.275964464430349 (p-value=0.111111111111111)
drug postbiom resistance local obs exp local_p
7 drug (0.7,1] sensitive 0.7958663 0.21 0.05405 0.0000000
1 drug [0,0.7] resistant 0.2062060 0.24 0.18285 0.3333333
6 placebo [0,0.7] sensitive 0.1775434 0.24 0.19035 0.3333333
2 placebo [0,0.7] resistant 0.1755193 0.27 0.21465 0.3333333
8 placebo (0.7,1] sensitive -0.6847912 0.02 0.06345 0.3333333
3 drug (0.7,1] resistant -0.8359311 0.01 0.06095 0.2222222
4 placebo (0.7,1] resistant -0.8602376 0.01 0.07155 0.2222222
5 drug [0,0.7] sensitive -1.0000000 0.00 0.16215 0.2222222
```

Although statistically significant, the global association is very weak because of the absence of a relationship between resistance and the other variables. Nonetheless, certain events are locally associated. For example, being in the test group, having recovered and being sensitive to the drug are positively associated events. This corresponds to the patients that have reacted to the drug. Note here that a non-significant global association can hide a significant local association.

Local association measures are issued from empirical research. Although these have proven their interest in diverse applications, theoretical studies of their mathematical properties are sparse. For example, only Monte Carlo simulations of Ducher’s Z behavior are available (M. Ducher et al. 1994). A more theoretical approach to these measures could be of interest. For example, by determining the theoretical null distribution of these measures. In addition, we have assumed mutually exclusivity of events for the multivariate association measures. This assumption may be too stringent for certain variables and usage of other independence models such as conditional independence may prove to be worthwhile.

Improvements to the `zebu`

R package are also possible. For example, in `zebu`

, discretization is a necessary step for studying continuous variables. We have restrained ourselves to very simple discretization methods: equal-width and user-defined. Other discretization algorithms exist (R. Dash, Paramguru, and Dash 2011) and may be more adapted for the computation of association measures. Moreover, kernel methods could also be used to better handle continuous variables. Secondly, estimation of probabilities is done from the frequentist maximum-likelihood procedure which requires sufficiently large datasets. Unfortunately, in certain fields such as health sciences, datasets are sparse. Bayesian estimation methods have been shown to be more robust to small sample sizes by not relying on asymptomatic assumptions and by allowing integration of prior knowledge (Wilkinson 2007). Such an implementation may also prove to be of interest. Finally, the `permtest`

function in Zebu is based on an iterative procedure that is slow in R. To speed this up, writing the function in C and calling it from R could be a reliable solution.

The authors declare that they have no competing interests.