This vignette illustrates the basical use of the `PLNLDA`

function and the methods accompaning the R6 Classes `PLNLDA`

and `PLNLDAfit`

.

The packages required for the analysis are **PLNmodels** plus some others for data manipulation and representation:

We illustrate our point with the trichoptera data set, a full description of which can be found in the corresponding vignette. Data preparation is also detailed in the specific vignette.

The `trichoptera`

data frame stores a matrix of counts (`trichoptera$Abundance`

), a matrix of offsets (`trichoptera$Offset`

) and some vectors of covariates (`trichoptera$Wind`

, `trichoptera$Temperature`

, etc.) In the following, we’re particularly interested in the `trichoptera$Group`

**discrete** covariate which corresponds to disjoint time spans during which the catching took place. The correspondance between group label and time spans is:

Label | Number.of.Consecutive.Nights | Date |
---|---|---|

1 | 12 | June 59 |

2 | 5 | June 59 |

3 | 5 | June 59 |

4 | 4 | June 59 |

5 | 4 | July 59 |

6 | 1 | June 59 |

7 | 3 | June 60 |

8 | 4 | June 60 |

9 | 5 | June 60 |

10 | 4 | June 60 |

11 | 1 | June 60 |

12 | 1 | July 60 |

In the vein of Fisher (1936) and Rao (1948), we introduce a multi-class LDA model for multivariate count data which is a variant of the Poisson Lognormal model of Aitchison and Ho (1989) (see the PLN vignette as a reminder). Indeed, it can viewed as a PLN model with a discrete group structure in the latent gaussian space.

This PLN-LDA model can be written in a hierachical framework where a sample of \(p\)-dimensional observation vectors \(\mathbf{Y}_i\) is related to some \(p\)-dimensional vectors of latent variables \(\mathbf{Z}_i\) and a discrete structure with \(K\) groups in the following way: \[\begin{equation} \begin{array}{rcl} \text{group structure } & \mathbf{\mu}_i = \mu_{g_i} & g_i \in \{1, \dots, K\} \\ \text{latent space } & \mathbf{Z}_i \quad \text{indep.} & \mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_i, \boldsymbol{\Sigma}) & \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right) \end{array} \end{equation}\] where \(g_i\) denotes the group sample \(i\) belongs to.

The different parameters \({\boldsymbol\mu}_k \in\mathbb{R}^p\) corresponds to the group-specific main effects and the variance matrix \(\boldsymbol{\Sigma}\) is shared among groups. An equivalent way of writting this model is the following: \[\begin{equation} \begin{array}{rcl} \text{latent space } & \mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_i,\boldsymbol\Sigma) & \boldsymbol{\mu}_i = \mathbf{g}_i^\top \mathbf{M} \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right), \end{array} \end{equation}\] where, with a slight abuse of notation, \(\mathbf{g}_i\) is a group-indicator vector of length \(K\) (\(g_{ik} = 1 \Leftrightarrow g_i = k\)) and \(\mathbf{M} = [\boldsymbol{\mu}_1^\top, \dots, \boldsymbol{\mu}_K^\top]^\top\) is a \(K \times p\) matrix collecting the group-specific main effects.

Just like PLN, PLN-LDA generalizes to a formulation close to a multivariate generalized linear model where the main effect is due to a linear combination of the discrete group structure, \(d\) covariates \(\mathbf{x}_i\) and a vector \(\mathbf{o}_i\) of \(p\) offsets in sample \(i\). The latent layer then reads \[\begin{equation} \mathbf{Z}_i \sim \mathcal{N}({\mathbf{o}_i + \mathbf{g}_i^\top \mathbf{M} + \mathbf{x}_i^\top\boldsymbol\Theta},\boldsymbol\Sigma) \end{equation}\] where \(\boldsymbol\Theta\) is a \(d\times p\) matrix of regression parameters.

Given:

- a new observation \(\mathbf{Y}\) with associated offset \(\mathbf{o}\) and covariates \(\mathbf{x}\)
- a model with estimated parameters \(\hat{\boldsymbol{\Sigma}}\), \(\hat{\boldsymbol{\Theta}}\), \(\hat{\mathbf{M}}\) and group counts \((n_1, \dots, n_K)\)

We can predict the observation’s group using Bayes rule as follows: for \(k \in {1, \dots, K}\), compute \[\begin{equation} \begin{aligned} f_k(\mathbf{Y}) & = p(\mathbf{Y} | \mathbf{g} = k, \mathbf{o}, \mathbf{x}, \hat{\boldsymbol{\Theta}}, \hat{\boldsymbol{\Sigma}}) \\ & = \boldsymbol{\Phi}_{PLN}(\mathbf{Y}; \mathbf{o} + \boldsymbol{\mu}_k + \mathbf{x}^\top \hat{\boldsymbol{\Theta}}, \hat{\boldsymbol{\Sigma}}) \\ p_k & = \frac{n_k}{\sum_{k' = 1}^K n_{k'}} \end{aligned} \end{equation}\] where \(\boldsymbol{\Phi}_{PLN}(\bullet; \boldsymbol{\mu}, \boldsymbol{\Sigma})\) is the density function of a PLN distribution with parameters \((\boldsymbol{\mu}, \boldsymbol{\Sigma})\). \(f_k(\mathbf{Y})\) and \(p_k\) are respectively plug-in estimates of (i) the probability of observing counts \(\mathbf{Y}\) in a sample from group \(k\) and (ii) the probability that a sample originates from group \(k\).

The posterior probability \(\hat{\pi}_k(\mathbf{Y})\) that observation \(\mathbf{Y}\) belongs to group \(k\) and most likely group \(\hat{k}(\mathbf{Y})\) can thus be defined as \[\begin{equation} \begin{aligned} \hat{\pi}_k(\mathbf{Y}) & = \frac{p_k f_k(\mathbf{Y})}{\sum_{k' = 1}^K p_{k'} f_{k'}(\mathbf{Y})} \\ \hat{k}(\mathbf{Y}) & = \underset{k \in \{1, \dots, K\}}{\arg\max} \hat{\pi}_k(\mathbf{Y}) \end{aligned} \end{equation}\]

Classification and prediction are the main objectives in (PLN-)LDA. To reach this goal, we first need to estimate the model parameters. Inference in PLN-LDA focuses on the group-specific main effects \(\mathbf{M}\), the regression parameters \(\boldsymbol\Theta\) and the covariance matrix \(\boldsymbol\Sigma\). Technically speaking, we can treat \(\mathbf{g}_i\) as a discrete covariate and estimate \([\mathbf{M}, \boldsymbol{\Theta}]\) using the same strategy as for the standard **PLN** model. Briefly, we adopt a variational strategy to approximate the log-likelihood function and optimize the consecutive variational surrogate of the log-likelihood with a gradient-ascent-based approach. To this end, we rely on the CCSA algorithm of Svanberg (2002) implemented in the C++ library (Johnson 2011), which we link to the package.

In the package, the PLN-LDA model is adjusted with the function `PLNLDA`

, which we review in this section. This function adjusts the model and stores it in a object of class `PLNLDAfit`

which inherits from the class `PLNfit`

, so we strongly recommend the reader to be somehow comfortable with `PLN`

and `PLNfit`

before using `PLNLDA`

(see the PLN vignette).

We start by adjusting the above model to the Trichoptera data set. We use `Group`

, the catching time spans, as a discrete structure and use log as an offset to capture differences in sampling luck.

The model can be fitted with the function `PLNLDA`

as follows:

```
##
## Performing discriminant Analysis...
## DONE!
```

Note that `PLNLDA`

uses the standard `formula`

interface, like every other model in the **PLNmodels** package.

`PLNLDAfit`

The `myLDA_nocov`

variable is an `R6`

object with class `PLNLDAfit`

, which comes with a couple of methods. The most basic is the `show/print`

method, which sends a brief summary of the estimation process and available methods:

```
## Linear Discriminant Analysis for Poisson Lognormal distribution
## ==================================================================
## nb_param loglik BIC ICL R_squared
## 391 -799.525 -1560.376 -1140.4 0.479
## ==================================================================
## * Useful fields
## $model_par, $latent, $var_par, $optim_par
## $loglik, $BIC, $ICL, $loglik_vec, $nb_param, $criteria
## * Useful S3 methods
## print(), coef(), sigma(), vcov(), fitted(), predict(), standard_error()
## * Additional fields for LDA
## $percent_var, $corr_map, $scores, $group_means
## * Additional S3 methods for LDA
## plot.PLNLDAfit(), predict.PLNLDAfit()
```

Comprehensive information about `PLNLDAfit`

is available via `?PLNLDAfit`

.

The user can easily access several fields of the `PLNLDAfit`

object using `S3`

methods, the most interesting ones are

- the \(p \times p\) covariance matrix \(\hat{\boldsymbol{\Sigma}}\):

- the regression coefficient matrix \(\hat{\boldsymbol{\Theta}}\) (in this case
`NULL`

as there are no covariates)

`## NULL`

- the \(p \times K\) matrix of group means \(\mathbf{M}\)

grouping1 | grouping2 | grouping3 | grouping4 | grouping5 | grouping6 | grouping7 | grouping8 | grouping9 | grouping10 | grouping11 | grouping12 |
---|---|---|---|---|---|---|---|---|---|---|---|

-23.93 | -6.82 | -23.50 | -22.33 | -25.28 | -25.86 | -24.17 | -23.32 | -23.39 | -5.65 | -3.47 | -22.31 |

-23.92 | -24.66 | -5.67 | -22.31 | -7.44 | -25.84 | -24.15 | -23.30 | -23.37 | -23.44 | -21.28 | -4.47 |

-2.38 | -3.96 | -2.36 | -2.92 | -6.52 | -5.54 | -2.69 | -2.03 | -2.70 | -2.65 | -21.38 | -3.85 |

-23.93 | -24.68 | -5.69 | -4.52 | -25.22 | -25.90 | -24.18 | -5.51 | -5.59 | -4.91 | -21.33 | -22.35 |

-0.28 | -0.26 | -0.62 | -1.09 | -0.61 | -0.11 | -0.66 | -0.80 | -0.39 | -0.45 | -0.61 | -1.16 |

-4.00 | -3.32 | -2.93 | -4.47 | -6.72 | -8.00 | -2.85 | -3.06 | -5.53 | -3.99 | -21.26 | -22.28 |

The `PLNLDAfit`

class also benefits from two important methods: `plot`

and `predict`

.

`plot`

methodThe `plot`

methods provides easy to interpret graphics which reveals here that the groups are well separated:

By default, `plot`

shows the first 3 axis of the LDA when there are 4 or more groups and uses special representations for the edge cases of 3 or less groups.

`ggplot2`

-savvy users who want to make their own representations can extracts the \(n \times (K-1)\) matrix of sample scores from the `PLNLDAfit`

object …

3280815 | -293274.9 | 165207.0 | -58387.12 | 17377.18 | 8675.17 | -1075.38 | -1315.04 | 107.62 | 12.65 | -116.04 |

3281709 | -289098.1 | 165412.1 | -58751.17 | 17240.77 | 8688.21 | -1089.94 | -1354.69 | 103.19 | 14.04 | -117.00 |

3290969 | -250510.2 | 167254.7 | -61123.51 | 16077.11 | 8539.15 | -897.29 | -1313.51 | 124.62 | 17.71 | -110.79 |

3301250 | -206744.8 | 169374.8 | -63673.19 | 14789.74 | 8307.86 | -591.64 | -1131.26 | 156.45 | 19.56 | -103.00 |

3295223 | -232052.5 | 168176.6 | -62152.42 | 15554.44 | 8411.70 | -733.69 | -1176.34 | 141.19 | 17.88 | -107.12 |

3287620 | -264169.2 | 166623.2 | -60246.57 | 16501.61 | 8567.21 | -932.11 | -1274.87 | 120.00 | 15.52 | -112.70 |

…or the \(p \times (K-1)\) matrix of correlations between scores and (latent) variables

Che | -0.26 | -0.14 | -0.61 | 0.23 | -0.60 | -0.32 | -0.15 | -0.39 | 0.84 | -0.48 | 0.04 |

Hyc | -0.03 | 0.51 | 0.13 | 0.39 | -0.01 | -0.49 | 0.64 | 0.92 | -0.09 | -0.17 | 0.59 |

Hym | 0.08 | 0.15 | 0.22 | -0.77 | 0.33 | -0.13 | -0.41 | -0.02 | -0.32 | 0.02 | -0.61 |

Hys | -0.54 | -0.12 | 0.20 | -0.07 | 0.27 | -0.45 | -0.43 | 0.00 | -0.30 | -0.03 | 0.28 |

Psy | 0.24 | -0.32 | -0.19 | -0.26 | -0.40 | 0.24 | -0.16 | -0.04 | 0.50 | -0.60 | -0.35 |

Aga | 0.10 | 0.11 | 0.24 | -0.89 | 0.05 | -0.20 | -0.20 | -0.17 | -0.10 | -0.21 | -0.51 |

`predict`

methodThe `predict`

method has a slightly different behavior than its siblings in other models of the **PLNmodels**. The goal of `predict`

is to predict the discrete class based on observed *species counts* (rather than predicting counts from known covariates).

By default, the `predict`

use the argument `type = "posterior"`

to output the matrix of log-posterior probabilities \(\log(\hat{\pi})_k\)

```
predicted.class <- predict(myLDA_nocov, newdata = trichoptera)
## equivalent to
## predicted.class <- predict(myLDA_nocov, newdata = trichoptera, type = "posterior")
predicted.class %>% head() %>% knitr::kable(digits = 2)
```

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|

-17.45 | -43.44 | -40.48 | -25.47 | -51.64 | -32.77 | -25.91 | -78.34 | -75.40 | -20.60 | -180.29 | -68.74 |

-9.66 | -17.04 | -53.83 | -21.33 | -66.08 | -27.31 | -14.68 | -15.57 | -15.20 | -14.69 | -106.65 | -60.31 |

-12.93 | -22.56 | -102.94 | -40.00 | -107.76 | -30.78 | -22.53 | -47.05 | -41.44 | -22.94 | -130.78 | -108.40 |

-18.43 | -31.03 | -112.91 | -116.36 | -134.87 | -54.03 | -41.96 | -105.56 | -94.14 | -44.70 | -274.09 | -206.36 |

-13.84 | -20.00 | -49.69 | -61.02 | -69.57 | -38.17 | -26.90 | -48.81 | -42.22 | -25.89 | -172.25 | -106.56 |

-9.24 | -13.48 | -34.17 | -24.40 | -45.11 | -23.73 | -14.61 | -15.91 | -15.29 | -14.32 | -88.89 | -61.65 |

You can also show them in the standard (and human-friendly) \([0, 1]\) scale with `scale = "prob"`

to get the matrix \(\hat{\pi}_k\)

```
predicted.class <- predict(myLDA_nocov, newdata = trichoptera, scale = "prob")
predicted.class %>% head() %>% knitr::kable(digits = 3)
```

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|

0.958 | 0.000 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.041 | 0 | 0 |

0.980 | 0.001 | 0 | 0 | 0 | 0 | 0.006 | 0.003 | 0.004 | 0.006 | 0 | 0 |

1.000 | 0.000 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0 |

1.000 | 0.000 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0 |

0.998 | 0.002 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0 |

0.972 | 0.014 | 0 | 0 | 0 | 0 | 0.005 | 0.001 | 0.002 | 0.006 | 0 | 0 |

Setting `type = "response"`

, we can predict the most likely group \(\hat{k}\) instead:

```
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1 1 1 1 1 1 9 2 1 1 1 1 2 3 2 2 2 3 3 3 9 3 4 1 4 4
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## 12 5 4 5 6 7 7 7 8 8 8 8 8 1 9 9 9 10 10 10 10 11 12
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12
```

We can assess that the predictions are quite similar to the real group (*this is not a proper validation of the method as we used dataset for both model fitting and prediction and are thus at risk of overfitting*).

```
## true
## predicted 1 2 3 4 5 6 7 8 9 10 11 12
## 1 10 0 0 1 0 0 0 0 1 0 0 0
## 2 1 4 0 0 0 0 0 0 0 0 0 0
## 3 0 1 4 0 0 0 0 0 0 0 0 0
## 4 0 0 0 3 1 0 0 0 0 0 0 0
## 5 0 0 0 0 2 0 0 0 0 0 0 0
## 6 0 0 0 0 0 1 0 0 0 0 0 0
## 7 0 0 0 0 0 0 3 0 0 0 0 0
## 8 0 0 0 0 0 0 0 4 1 0 0 0
## 9 1 0 1 0 0 0 0 0 3 0 0 0
## 10 0 0 0 0 0 0 0 0 0 4 0 0
## 11 0 0 0 0 0 0 0 0 0 0 1 0
## 12 0 0 0 0 1 0 0 0 0 0 0 1
```

Finally, we can get the coordinates of the new data on the same graph at the original ones with `type = "scores"`

. This is done by averaging the latent positions \(\hat{\mathbf{Z}}_i + \boldsymbol{\mu}_k\) (found when the sample is assumed to come from group \(k\)) and weighting them with the \(\hat{\pi}_k\). Some samples, have compositions that put them very far from their group mean.

```
library(ggplot2)
predicted.scores <- predict(myLDA_nocov, newdata = trichoptera, type = "scores")
colnames(predicted.scores) <- paste0("Axis.", 1:ncol(predicted.scores))
predicted.scores <- as.data.frame(predicted.scores)
predicted.scores$group <- trichoptera$Group
plot(myLDA_nocov, map = "individual", nb_axes = 2, plot = FALSE) +
geom_point(data = predicted.scores,
aes(x = Axis.1, y = Axis.2, color = group, label = NULL))
```

It is possible to correct for other covariates before finding the LDA axes that best separate well the groups. In our case ,we’re going to use `Wind`

as a covariate and illustrate the main differences with before :

```
myLDA_cov <- PLNLDA(Abundance ~ Wind + 0 + offset(log(Offset)),
grouping = Group,
data = trichoptera)
```

```
##
## Performing discriminant Analysis...
## DONE!
```

All fields of our new `PLNLDA`

fit can be accessed as before with similar results. The only important difference is the result of `coef`

: since we included a covariate in the model, `coef`

now returns a 1-column matrix for \(\hat{\boldsymbol{\Theta}}\) instead of `NULL`

Wind | |
---|---|

Che | -0.3331789 |

Hyc | 1.1641220 |

Hym | -0.1930568 |

Hys | -0.4543200 |

Psy | 0.0388209 |

Aga | -0.0611778 |

The group-specific main effects can still be accessed with `$group_means`

grouping1 | grouping2 | grouping3 | grouping4 | grouping5 | grouping6 | grouping7 | grouping8 | grouping9 | grouping10 | grouping11 | grouping12 |
---|---|---|---|---|---|---|---|---|---|---|---|

-14.36 | -9.51 | -17.11 | -13.24 | -18.20 | -31.27 | -21.21 | -19.91 | -13.79 | 1.37 | 6.98 | -16.56 |

-17.14 | -31.50 | -2.64 | -11.03 | -10.41 | -35.05 | -23.88 | -21.20 | -10.20 | -12.16 | -3.35 | 0.21 |

-1.25 | -4.39 | -1.81 | -1.87 | -5.80 | -6.77 | -2.66 | -1.85 | -1.65 | -1.64 | -17.16 | -3.35 |

-17.23 | -23.59 | -3.60 | -0.97 | -19.19 | -27.70 | -21.38 | -5.27 | -1.79 | -1.18 | -12.89 | -17.96 |

-0.05 | -0.42 | -0.47 | -0.85 | -0.48 | -0.43 | -0.61 | -0.72 | -0.06 | -0.15 | -0.18 | -1.03 |

-1.97 | -4.03 | -1.81 | -2.56 | -5.55 | -10.33 | -2.72 | -2.76 | -3.47 | -1.92 | -15.62 | -18.76 |

`plot`

methodOnce again, the `plot`

method is very useful to get a quick look at the results.

`predict`

methodWe can again predict the most likely group for each sample :

and check that we recover the correct class in most cases (again, we used the same dataset for model fitting and group prediction only for ease of exposition):

```
## true
## predicted 1 2 3 4 5 6 7 8 9 10 11 12
## 1 11 0 0 1 0 0 0 0 1 0 0 0
## 2 1 4 0 0 0 0 0 0 0 0 0 0
## 3 0 1 4 0 0 0 0 0 0 0 0 0
## 4 0 0 0 3 1 0 0 0 0 0 0 0
## 5 0 0 0 0 2 0 0 0 0 0 0 0
## 6 0 0 0 0 0 1 0 0 0 0 0 0
## 7 0 0 0 0 0 0 3 0 0 0 0 0
## 8 0 0 0 0 0 0 0 4 1 0 0 0
## 9 0 0 1 0 0 0 0 0 3 0 0 0
## 10 0 0 0 0 0 0 0 0 0 4 0 0
## 11 0 0 0 0 0 0 0 0 0 0 1 0
## 12 0 0 0 0 1 0 0 0 0 0 0 1
```

Aitchison, J., and C.H. Ho. 1989. “The Multivariate Poisson-Log Normal Distribution.” *Biometrika* 76 (4). Oxford University Press: 643–53.

Fisher, R. A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.” *Annals of Eugenics* 7 (2): 179–88.

Johnson, Steven G. 2011. *The Nlopt Nonlinear-Optimization Package*. http://ab-initio.mit.edu/nlopt.

Rao, C. Radhakrishna. 1948. “The Utilization of Multiple Measurements in Problems of Biological Classification.” *Journal of the Royal Statistical Society. Series B (Methodological)* 10 (2). [Royal Statistical Society, Wiley]: 159–203.

Svanberg, Krister. 2002. “A Class of Globally Convergent Optimization Methods Based on Conservative Convex Separable Approximations.” *SIAM Journal on Optimization* 12 (2). SIAM: 555–73.