# Supervized classification of multivariate count table with the Poisson discriminant Analysis

## Preliminaries

This vignette illustrates the basical use of the PLNLDA function and the methods accompaning the R6 Classes PLNLDA and PLNLDAfit.

### Requirements

The packages required for the analysis are PLNmodels plus some others for data manipulation and representation:

library(PLNmodels)

### Data set

We illustrate our point with the trichoptera data set, a full description of which can be found in the corresponding vignette. Data preparation is also detailed in the specific vignette.

data(trichoptera)
trichoptera <- prepare_data(trichoptera$Abundance, trichoptera$Covariate)

The trichoptera data frame stores a matrix of counts (trichoptera$Abundance), a matrix of offsets (trichoptera$Offset) and some vectors of covariates (trichoptera$Wind, trichoptera$Temperature, etc.) In the following, we’re particularly interested in the trichopteraGroup discrete covariate which corresponds to disjoint time spans during which the catching took place. The correspondance between group label and time spans is: Label Number.of.Consecutive.Nights Date 1 12 June 59 2 5 June 59 3 5 June 59 4 4 June 59 5 4 July 59 6 1 June 59 7 3 June 60 8 4 June 60 9 5 June 60 10 4 June 60 11 1 June 60 12 1 July 60 ### Mathematical background In the vein of Fisher (1936) and Rao (1948), we introduce a multi-class LDA model for multivariate count data which is a variant of the Poisson Lognormal model of Aitchison and Ho (1989) (see the PLN vignette as a reminder). Indeed, it can viewed as a PLN model with a discrete group structure in the latent gaussian space. This PLN-LDA model can be written in a hierachical framework where a sample of $$p$$-dimensional observation vectors $$\mathbf{Y}_i$$ is related to some $$p$$-dimensional vectors of latent variables $$\mathbf{Z}_i$$ and a discrete structure with $$K$$ groups in the following way: $$$\begin{array}{rcl} \text{group structure } & \mathbf{\mu}_i = \mu_{g_i} & g_i \in \{1, \dots, K\} \\ \text{latent space } & \mathbf{Z}_i \quad \text{indep.} & \mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_i, \boldsymbol{\Sigma}) & \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right) \end{array}$$$ where $$g_i$$ denotes the group sample $$i$$ belongs to. The different parameters $${\boldsymbol\mu}_k \in\mathbb{R}^p$$ corresponds to the group-specific main effects and the variance matrix $$\boldsymbol{\Sigma}$$ is shared among groups. An equivalent way of writting this model is the following: $$$\begin{array}{rcl} \text{latent space } & \mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_i,\boldsymbol\Sigma) & \boldsymbol{\mu}_i = \mathbf{g}_i^\top \mathbf{M} \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right), \end{array}$$$ where, with a slight abuse of notation, $$\mathbf{g}_i$$ is a group-indicator vector of length $$K$$ ($$g_{ik} = 1 \Leftrightarrow g_i = k$$) and $$\mathbf{M} = [\boldsymbol{\mu}_1^\top, \dots, \boldsymbol{\mu}_K^\top]^\top$$ is a $$K \times p$$ matrix collecting the group-specific main effects. #### Covariates and offsets Just like PLN, PLN-LDA generalizes to a formulation close to a multivariate generalized linear model where the main effect is due to a linear combination of the discrete group structure, $$d$$ covariates $$\mathbf{x}_i$$ and a vector $$\mathbf{o}_i$$ of $$p$$ offsets in sample $$i$$. The latent layer then reads $$$\mathbf{Z}_i \sim \mathcal{N}({\mathbf{o}_i + \mathbf{g}_i^\top \mathbf{M} + \mathbf{x}_i^\top\boldsymbol\Theta},\boldsymbol\Sigma)$$$ where $$\boldsymbol\Theta$$ is a $$d\times p$$ matrix of regression parameters. #### Prediction Given: • a new observation $$\mathbf{Y}$$ with associated offset $$\mathbf{o}$$ and covariates $$\mathbf{x}$$ • a model with estimated parameters $$\hat{\boldsymbol{\Sigma}}$$, $$\hat{\boldsymbol{\Theta}}$$, $$\hat{\mathbf{M}}$$ and group counts $$(n_1, \dots, n_K)$$ We can predict the observation’s group using Bayes rule as follows: for $$k \in {1, \dots, K}$$, compute \begin{aligned} f_k(\mathbf{Y}) & = p(\mathbf{Y} | \mathbf{g} = k, \mathbf{o}, \mathbf{x}, \hat{\boldsymbol{\Theta}}, \hat{\boldsymbol{\Sigma}}) \\ & = \boldsymbol{\Phi}_{PLN}(\mathbf{Y}; \mathbf{o} + \boldsymbol{\mu}_k + \mathbf{x}^\top \hat{\boldsymbol{\Theta}}, \hat{\boldsymbol{\Sigma}}) \\ p_k & = \frac{n_k}{\sum_{k' = 1}^K n_{k'}} \end{aligned} where $$\boldsymbol{\Phi}_{PLN}(\bullet; \boldsymbol{\mu}, \boldsymbol{\Sigma})$$ is the density function of a PLN distribution with parameters $$(\boldsymbol{\mu}, \boldsymbol{\Sigma})$$. $$f_k(\mathbf{Y})$$ and $$p_k$$ are respectively plug-in estimates of (i) the probability of observing counts $$\mathbf{Y}$$ in a sample from group $$k$$ and (ii) the probability that a sample originates from group $$k$$. The posterior probability $$\hat{\pi}_k(\mathbf{Y})$$ that observation $$\mathbf{Y}$$ belongs to group $$k$$ and most likely group $$\hat{k}(\mathbf{Y})$$ can thus be defined as \begin{aligned} \hat{\pi}_k(\mathbf{Y}) & = \frac{p_k f_k(\mathbf{Y})}{\sum_{k' = 1}^K p_{k'} f_{k'}(\mathbf{Y})} \\ \hat{k}(\mathbf{Y}) & = \underset{k \in \{1, \dots, K\}}{\arg\max} \hat{\pi}_k(\mathbf{Y}) \end{aligned} #### Optimization by Variational inference Classification and prediction are the main objectives in (PLN-)LDA. To reach this goal, we first need to estimate the model parameters. Inference in PLN-LDA focuses on the group-specific main effects $$\mathbf{M}$$, the regression parameters $$\boldsymbol\Theta$$ and the covariance matrix $$\boldsymbol\Sigma$$. Technically speaking, we can treat $$\mathbf{g}_i$$ as a discrete covariate and estimate $$[\mathbf{M}, \boldsymbol{\Theta}]$$ using the same strategy as for the standard PLN model. Briefly, we adopt a variational strategy to approximate the log-likelihood function and optimize the consecutive variational surrogate of the log-likelihood with a gradient-ascent-based approach. To this end, we rely on the CCSA algorithm of Svanberg (2002) implemented in the C++ library (Johnson 2011), which we link to the package. ## Analysis of trichoptera data with a PLN-LDA model In the package, the PLN-LDA model is adjusted with the function PLNLDA, which we review in this section. This function adjusts the model and stores it in a object of class PLNLDAfit which inherits from the class PLNfit, so we strongly recommend the reader to be somehow comfortable with PLN and PLNfit before using PLNLDA (see the PLN vignette). ### A model with main effects and no covariates We start by adjusting the above model to the Trichoptera data set. We use Group, the catching time spans, as a discrete structure and use log as an offset to capture differences in sampling luck. The model can be fitted with the function PLNLDA as follows: myLDA_nocov <- PLNLDA(Abundance ~ 0 + offset(log(Offset)), grouping = Group, data = trichoptera) ## ## Performing discriminant Analysis... ## DONE! Note that PLNLDA uses the standard formula interface, like every other model in the PLNmodels package. #### Structure of PLNLDAfit The myLDA_nocov variable is an R6 object with class PLNLDAfit, which comes with a couple of methods. The most basic is the show/print method, which sends a brief summary of the estimation process and available methods: myLDA_nocov ## Linear Discriminant Analysis for Poisson Lognormal distribution ## ================================================================== ## nb_param loglik BIC ICL R_squared ## 391 -799.525 -1560.376 -1140.4 0.479 ## ================================================================== ## * Useful fields ##model_par, $latent,$var_par, $optim_par ##$loglik, $BIC,$ICL, $loglik_vec,$nb_param, $criteria ## * Useful S3 methods ## print(), coef(), sigma(), vcov(), fitted(), predict(), standard_error() ## * Additional fields for LDA ##$percent_var, $corr_map,$scores, $group_means ## * Additional S3 methods for LDA ## plot.PLNLDAfit(), predict.PLNLDAfit() Comprehensive information about PLNLDAfit is available via ?PLNLDAfit. #### Specific fields The user can easily access several fields of the PLNLDAfit object using S3 methods, the most interesting ones are • the $$p \times p$$ covariance matrix $$\hat{\boldsymbol{\Sigma}}$$: sigma(myLDA_nocov) %>% corrplot::corrplot(is.corr = FALSE) • the regression coefficient matrix $$\hat{\boldsymbol{\Theta}}$$ (in this case NULL as there are no covariates) coef(myLDA_nocov) ## NULL • the $$p \times K$$ matrix of group means $$\mathbf{M}$$ myLDA_nocov$group_means %>% head() %>% knitr::kable(digits = 2)
grouping1 grouping2 grouping3 grouping4 grouping5 grouping6 grouping7 grouping8 grouping9 grouping10 grouping11 grouping12
-23.93 -6.82 -23.50 -22.33 -25.28 -25.86 -24.17 -23.32 -23.39 -5.65 -3.47 -22.31
-23.92 -24.66 -5.67 -22.31 -7.44 -25.84 -24.15 -23.30 -23.37 -23.44 -21.28 -4.47
-2.38 -3.96 -2.36 -2.92 -6.52 -5.54 -2.69 -2.03 -2.70 -2.65 -21.38 -3.85
-23.93 -24.68 -5.69 -4.52 -25.22 -25.90 -24.18 -5.51 -5.59 -4.91 -21.33 -22.35
-0.28 -0.26 -0.62 -1.09 -0.61 -0.11 -0.66 -0.80 -0.39 -0.45 -0.61 -1.16
-4.00 -3.32 -2.93 -4.47 -6.72 -8.00 -2.85 -3.06 -5.53 -3.99 -21.26 -22.28

The PLNLDAfit class also benefits from two important methods: plot and predict.

#### plot method

The plot methods provides easy to interpret graphics which reveals here that the groups are well separated:

plot(myLDA_nocov)

By default, plot shows the first 3 axis of the LDA when there are 4 or more groups and uses special representations for the edge cases of 3 or less groups.

ggplot2-savvy users who want to make their own representations can extracts the $$n \times (K-1)$$ matrix of sample scores from the PLNLDAfit object …

myLDA_nocov$scores %>% head %>% knitr::kable(digits = 2)  3280815 -293275 165207 -58387.1 17377.2 8675.17 -1075.38 -1315.04 107.62 12.65 -116.04 3281709 -289098 165412 -58751.2 17240.8 8688.21 -1089.94 -1354.69 103.19 14.04 -117 3290969 -250510 167255 -61123.5 16077.1 8539.15 -897.29 -1313.51 124.62 17.71 -110.79 3301250 -206745 169375 -63673.2 14789.7 8307.86 -591.64 -1131.26 156.45 19.56 -103 3295223 -232052 168177 -62152.4 15554.4 8411.7 -733.69 -1176.34 141.19 17.88 -107.12 3287620 -264169 166623 -60246.6 16501.6 8567.21 -932.11 -1274.87 120 15.52 -112.7 …or the $$p \times (K-1)$$ matrix of correlations between scores and (latent) variables myLDA_nocov$corr_map %>% head %>% knitr::kable(digits = 2)
 Che -0.26 -0.14 -0.61 0.23 -0.6 -0.32 -0.15 -0.39 0.84 -0.48 0.04 Hyc -0.03 0.51 0.13 0.39 -0.01 -0.49 0.64 0.92 -0.09 -0.17 0.59 Hym 0.08 0.15 0.22 -0.77 0.33 -0.13 -0.41 -0.02 -0.32 0.02 -0.61 Hys -0.54 -0.12 0.2 -0.07 0.27 -0.45 -0.43 0 -0.3 -0.03 0.28 Psy 0.24 -0.32 -0.19 -0.26 -0.4 0.24 -0.16 -0.04 0.5 -0.6 -0.35 Aga 0.1 0.11 0.24 -0.89 0.05 -0.2 -0.2 -0.17 -0.1 -0.21 -0.51

#### predict method

The predict method has a slightly different behavior than its siblings in other models of the PLNmodels. The goal of predict is to predict the discrete class based on observed species counts (rather than predicting counts from known covariates).

By default, the predict use the argument type = "posterior" to output the matrix of log-posterior probabilities $$\log(\hat{\pi})_k$$

predicted.class <- predict(myLDA_nocov, newdata = trichoptera)
## equivalent to
## predicted.class <- predict(myLDA_nocov, newdata = trichoptera,  type = "posterior")
predicted.class %>% head() %>% knitr::kable(digits = 2)
1 2 3 4 5 6 7 8 9 10 11 12
-17.45 -43.44 -40.48 -25.47 -51.64 -32.77 -25.91 -78.34 -75.40 -20.60 -180.29 -68.74
-9.66 -17.04 -53.83 -21.33 -66.08 -27.31 -14.68 -15.57 -15.20 -14.69 -106.65 -60.31
-12.93 -22.56 -102.94 -40.00 -107.76 -30.78 -22.53 -47.05 -41.44 -22.94 -130.78 -108.40
-18.43 -31.03 -112.91 -116.36 -134.87 -54.03 -41.96 -105.56 -94.14 -44.70 -274.09 -206.36
-13.84 -20.00 -49.69 -61.02 -69.57 -38.17 -26.90 -48.81 -42.22 -25.89 -172.25 -106.56
-9.24 -13.48 -34.17 -24.40 -45.11 -23.73 -14.61 -15.91 -15.29 -14.32 -88.89 -61.65

You can also show them in the standard (and human-friendly) $$[0, 1]$$ scale with scale = "prob" to get the matrix $$\hat{\pi}_k$$

predicted.class <- predict(myLDA_nocov, newdata = trichoptera, scale = "prob")
predicted.class %>% head() %>% knitr::kable(digits = 3)
1 2 3 4 5 6 7 8 9 10 11 12
0.958 0.000 0 0 0 0 0.000 0.000 0.000 0.041 0 0
0.980 0.001 0 0 0 0 0.006 0.003 0.004 0.006 0 0
1.000 0.000 0 0 0 0 0.000 0.000 0.000 0.000 0 0
1.000 0.000 0 0 0 0 0.000 0.000 0.000 0.000 0 0
0.998 0.002 0 0 0 0 0.000 0.000 0.000 0.000 0 0
0.972 0.014 0 0 0 0 0.005 0.001 0.002 0.006 0 0

Setting type = "response", we can predict the most likely group $$\hat{k}$$ instead:

predicted.class <- predict(myLDA_nocov, newdata = trichoptera,  type = "response")
predicted.class
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
##  1  1  1  1  1  1  9  2  1  1  1  1  2  3  2  2  2  3  3  3  9  3  4  1  4  4
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## 12  5  4  5  6  7  7  7  8  8  8  8  8  1  9  9  9 10 10 10 10 11 12
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12

We can assess that the predictions are quite similar to the real group (this is not a proper validation of the method as we used dataset for both model fitting and prediction and are thus at risk of overfitting).

myLDA_cov$group_means %>% head %>% knitr::kable(digits = 2) grouping1 grouping2 grouping3 grouping4 grouping5 grouping6 grouping7 grouping8 grouping9 grouping10 grouping11 grouping12 -14.36 -9.51 -17.11 -13.24 -18.20 -31.27 -21.21 -19.91 -13.79 1.37 6.98 -16.56 -17.14 -31.50 -2.64 -11.03 -10.41 -35.05 -23.88 -21.20 -10.20 -12.16 -3.35 0.21 -1.25 -4.39 -1.81 -1.87 -5.80 -6.77 -2.66 -1.85 -1.65 -1.64 -17.16 -3.35 -17.23 -23.59 -3.60 -0.97 -19.19 -27.70 -21.38 -5.27 -1.79 -1.18 -12.89 -17.96 -0.05 -0.42 -0.47 -0.85 -0.48 -0.43 -0.61 -0.72 -0.06 -0.15 -0.18 -1.03 -1.97 -4.03 -1.81 -2.56 -5.55 -10.33 -2.72 -2.76 -3.47 -1.92 -15.62 -18.76 #### plot method Once again, the plot method is very useful to get a quick look at the results. plot(myLDA_cov) #### predict method We can again predict the most likely group for each sample : predicted.class_cov <- predict(myLDA_cov, newdata = trichoptera, type = "response") and check that we recover the correct class in most cases (again, we used the same dataset for model fitting and group prediction only for ease of exposition): table(predicted.class_cov, trichoptera$Group, dnn = c("predicted", "true"))
##          true
## predicted  1  2  3  4  5  6  7  8  9 10 11 12
##        1  11  0  0  1  0  0  0  0  1  0  0  0
##        2   1  4  0  0  0  0  0  0  0  0  0  0
##        3   0  1  4  0  0  0  0  0  0  0  0  0
##        4   0  0  0  3  1  0  0  0  0  0  0  0
##        5   0  0  0  0  2  0  0  0  0  0  0  0
##        6   0  0  0  0  0  1  0  0  0  0  0  0
##        7   0  0  0  0  0  0  3  0  0  0  0  0
##        8   0  0  0  0  0  0  0  4  1  0  0  0
##        9   0  0  1  0  0  0  0  0  3  0  0  0
##        10  0  0  0  0  0  0  0  0  0  4  0  0
##        11  0  0  0  0  0  0  0  0  0  0  1  0
##        12  0  0  0  0  1  0  0  0  0  0  0  1

## References

Aitchison, J., and C.H. Ho. 1989. “The Multivariate Poisson-Log Normal Distribution.” Biometrika 76 (4). Oxford University Press: 643–53.

Fisher, R. A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7 (2): 179–88.

Johnson, Steven G. 2011. The Nlopt Nonlinear-Optimization Package. http://ab-initio.mit.edu/nlopt.

Rao, C. Radhakrishna. 1948. “The Utilization of Multiple Measurements in Problems of Biological Classification.” Journal of the Royal Statistical Society. Series B (Methodological) 10 (2). [Royal Statistical Society, Wiley]: 159–203.

Svanberg, Krister. 2002. “A Class of Globally Convergent Optimization Methods Based on Conservative Convex Separable Approximations.” SIAM Journal on Optimization 12 (2). SIAM: 555–73.