# Dimension reduction of multivariate count data with PLN-PCA

## Preliminaries

This vignette illustrates the basical use of the PLNPCA function and the methods accompaning the R6 Classes PLNPCAfamily and PLNPCAfit.

### Requirements

The packages required for the analysis are PLNmodels plus some others for data manipulation and representation:

library(PLNmodels)
library(ggplot2)
library(corrplot)

### Data set

We illustrate our point with the trichoptera data set, a full description of which can be found in the corresponding vignette. Data preparation is also detailed in the specific vignette.

data(trichoptera)
trichoptera <- prepare_data(trichoptera$Abundance, trichoptera$Covariate)

The trichoptera data frame stores a matrix of counts (trichoptera$Abundance), a matrix of offsets (trichoptera$Offset) and some vectors of covariates (trichoptera$Wind, trichoptera$Temperature, etc.)

### Mathematical background

In the vein of Tipping and Bishop (1999), we introduce in Chiquet, Mariadassou, and Robin (2018) a probabilistic PCA model for multivariate count data which is a variant of the Poisson Lognormal model of Aitchison and Ho (1989) (see the PLN vignette as a reminder). Indeed, it can viewed as a PLN model with an additional rank constraint on the covariance matrix $$\boldsymbol\Sigma$$ such that $$\mathrm{rank}(\boldsymbol\Sigma)= q$$.

This PLN-PCA model can be written in a hierachical framework where a sample of $$p$$-dimensional observation vectors $$\mathbf{Y}_i$$ is related to some $$q$$-dimensional vectors of latent variables $$\mathbf{W}_i$$ as follows: $\begin{equation} \begin{array}{rcl} \text{latent space } & \mathbf{W}_i \quad \text{i.i.d.} & \mathbf{W}_i \sim \mathcal{N}(\mathbf{0}_q, \mathbf{I}_q) \\ \text{parameter space } & \mathbf{Z}_i = {\boldsymbol\mu} + \mathbf{B}^\top \mathbf{W}_i & \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right) \end{array} \end{equation}$

The parameter $${\boldsymbol\mu}\in\mathbb{R}^p$$ corresponds to the main effects, the $$p\times q$$ matrix $$\mathbf{B}$$ to the loadings in the parameter spaces and $$\mathbf{W}_i$$ to the scores of the $$i$$-th observation in the low-dimensional latent subspace of the parameter space. The dimension of the latent space $$q$$ corresponds to the number of axes in the PCA or, in other words, to the rank of $$\mathbf{B}\mathbf{B}^\intercal$$. An hopefully more intuitive way of writting this model is the following: $\begin{equation} \begin{array}{rcl} \text{latent space } & \mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu},\boldsymbol\Sigma), \qquad \boldsymbol\Sigma = \mathbf{B}\mathbf{B}^\top \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right), \end{array} \end{equation}$ where the interpretation of PLN-PCA as a rank-constrained PLN model is more obvious.

#### Covariates and offsets

Just like PLN, PLN-PCA generalizes to a formulation close to a multivariate generalized linear model where the main effect is due to a linear combination of $$d$$ covariates $$\mathbf{x}_i$$ and to a vector $$\mathbf{o}_i$$ of $$p$$ offsets in sample $$i$$. The latent layer then reads $\begin{equation} \mathbf{Z}_i \sim \mathcal{N}({\mathbf{o}_i + \mathbf{x}_i^\top\boldsymbol\Theta},\boldsymbol\Sigma), \qquad \boldsymbol\Sigma = \mathbf{B}\mathbf{B}^\top, \end{equation}$ where $$\boldsymbol\Theta$$ is a $$d\times p$$ matrix of regression parameters.

#### Optimization by Variational inference

Dimension reduction and vizualization is the main objective in (PLN)-PCA. To reach this goal, we need to first estimate the model parameters. Inference in PLN-PCA focuses on the regression parameters $$\boldsymbol\Theta$$ and on the covariance matrix $$\boldsymbol\Sigma$$. Technically speaking, we adopt a variational strategy to approximate the log-likelihood function and optimize the consecutive variational surrogate of the log-likelihood with a gradient-ascent-based approach. To this end, we rely on the CCSA algorithm of Svanberg (2002) implemented in the C++ library (Johnson 2011), which we link to the package. Technical details can be found in Chiquet, Mariadassou, and Robin (2018).

## Analysis of trichoptera data with a PLNPCA model

In the package, the PLNPCA model is adjusted with the function PLNPCA, which we review in this section. This function adjusts the model for a series of value of $$q$$ and provides a collection of objects PLNPCAfit stored in an object with class PLNPCAfamily.

The class PLNPCAfit inherits from the class PLNfit, so we strongly recommend the reader to be comfortable with PLN and PLNfit before using PLNPCA (see the PLN vignette).

### A model with latent main effects for the Trichopetra data set

#### Adjusting a collection of fits

We fit a collection of $$q$$ models as follows:

PCA_models <- PLNPCA(
Abundance ~ 1 + offset(log(Offset)),
data  = trichoptera,
ranks = 1:5
)
##
##  Initialization...
##
##  Adjusting 5 PLN models for PCA analysis.
##   Rank approximation = 1
Rank approximation = 2
Rank approximation = 3
Rank approximation = 4
Rank approximation = 5
##  Post-treatments
##  DONE!

Note the use of the formula object to specify the model, similar to the one used in the function PLN.

#### Structure of PLNPCAfamily

The PCA_models variable is an R6 object with class PLNPCAfamily, which comes with a couple of methods. The most basic is the show/print method, which sends a brief summary of the estimation process:

PCA_models
## --------------------------------------------------------
## COLLECTION OF 5 POISSON LOGNORMAL MODELS
## --------------------------------------------------------
## ========================================================
##  - Ranks considered: from 1 to 5
##  - Best model (greater BIC): rank = 4 - R2 = 0.86
##  - Best model (greater ICL): rank = 3 - R2 = 0.8

One can also easily access the successive values of the criteria in the collection

PCA_models$criteria %>% knitr::kable() param nb_param loglik BIC ICL R_squared 1 34 -1458.282 -1524.443 -1536.694 0.4882522 2 51 -1146.810 -1246.051 -1279.686 0.7136556 3 68 -1057.785 -1190.107 -1247.081 0.8036381 4 85 -1012.067 -1177.469 -1266.955 0.8611270 5 102 -1003.248 -1201.731 -1328.160 0.8932537 A quick diagnostic of the optimization process is available via the convergence field: PCA_models$convergence  %>% knitr::kable()
param nb_param iterations status message
1 34 543 3 ftol_rel or ftol_abs was reached
2 51 959 3 ftol_rel or ftol_abs was reached
3 68 970 3 ftol_rel or ftol_abs was reached
4 85 2633 3 ftol_rel or ftol_abs was reached
5 102 1535 3 ftol_rel or ftol_abs was reached

Comprehensive information about PLNPCAfamily is available via ?PLNPCAfamily.

#### Model selection of rank $$q$$

The plot method of PLNPCAfamily displays evolution of the criteria mentioned above, and is a good starting point for model selection:

plot(PCA_models) In this case, the variational lower bound of the log-likelihood is hopefully strictly increasing with the number of axes (or subspace dimension). Also note the (approximated) $$R^2$$ which is displayed for each value of $$q$$ (see (Chiquet, Mariadassou, and Robin 2018) for details on its computation).

From this plot, we can see that the best model in terms of BIC or ICL is obtained for a rank $$q=4$$ or $$q=3$$. We may extract the corresponding model with the method getBestModel("ICL"). A model with a specific rank can be extracted with the getModel() method:

myPCA_ICL <- getBestModel(PCA_models, "ICL")
myPCA_BIC <- getModel(PCA_models, 3) # getBestModel(PCA_models, "BIC")  is equivalent here 

#### Structure of PLNPCAfit

Objects myPCA_ICL and myPCA_BIC are R6Class objects of class PLNPCAfit which in turns own a couple of methods, some inherited from PLNfit and some others specific, mostly for vizualization purposes. The plot method provides individual maps and correlation circles as in usual PCA. If an additional classification exists for the observations – which is the case here with the available classification of the trapping nights – , it can be passed as an argument to the function.1

plot(myPCA_ICL, ind_cols = trichoptera$Group) Among other fields and methods (see ?PLNPCAfit for a comprehensive view), the most interesting for the end-user in the context of PCA are • the regression coefficient matrix coef(myPCA_ICL) %>% head() %>% knitr::kable() (Intercept) Che -7.5790768 Hyc -8.1466088 Hym -3.0278324 Hys -6.8968412 Psy -0.5311005 Aga -3.8116184 • the estimated covariance matrix $$\boldsymbol\Sigma$$ with fixed rank sigma(myPCA_ICL) %>% corrplot(is.corr = FALSE) • the rotation matrix (in the latent space) myPCA_ICL$rotation %>% head() %>% knitr::kable()
 Che -0.233327 0.32026 0.0849517 Hyc -0.405657 -0.329103 0.138267 Hym -0.112457 0.0414272 0.34476 Hys -0.416779 0.199188 0.421979 Psy 0.0445403 0.0442388 -0.0613145 Aga 0.0659549 0.193355 0.292976
• the principal components values (or scores)
myPCA_ICL$scores %>% head() %>% knitr::kable()  -1.6162 -0.916724 0.623688 3.8569 0.117009 2.14923 7.59638 -0.962596 -0.193552 6.24705 -0.220069 -2.38914 4.55689 0.677044 -1.19307 4.09436 0.750704 0.546355 PLNPCAfit also inherits from the methods of PLNfit (see the appropriate vignette). Most are recalled via the show method: myPCA_ICL ## Poisson Lognormal with rank constrained for PCA (rank = 3) ## ================================================================== ## nb_param loglik BIC ICL R_squared ## 68 -1057.785 -1190.107 -1247.081 0.804 ## ================================================================== ## * Useful fields ##$model_par, $latent,$var_par, $optim_par ##$loglik, $BIC,$ICL, $loglik_vec,$nb_param, $criteria ## * Useful S3 methods ## print(), coef(), sigma(), vcov(), fitted(), predict(), standard_error() ## * Additional fields for PCA ##$percent_var, $corr_circle,$scores, $rotation ## * Additional S3 methods for PCA ## plot.PLNPCAfit() ### A model accounting for meteorological covariates A contribution of PLN-PCA is to let the possibility to taking into account some covariates in the parameter space. Such a strategy often completly changes the interpretation of PCA. Indeed, the covariates are often responsible for some strong structure in the data. The effect of the covariates should be removed since they are often quite obvious for the analyst and may hide some more important and subtile effects. In the case at hand, the covariates corresponds to the meteorological variables. Let us try to introduce some of them in our model, for instance, the temperature, the wind and the cloudiness. This can be done thanks to the model formula: PCA_models_cov <- PLNPCA( Abundance ~ 1 + offset(log(Offset)) + Temperature + Wind + Cloudiness, data = trichoptera, ranks = 1:4 ) ## ## Initialization... ## ## Adjusting 4 PLN models for PCA analysis. ## Rank approximation = 1 Rank approximation = 2 Rank approximation = 3 Rank approximation = 4 ## Post-treatments ## DONE! Again, the best model is obtained for $$q=3$$ classes. plot(PCA_models_cov) myPCA_cov <- getBestModel(PCA_models_cov, "ICL") Suppose that we want to have a closer look to the first two axes. This can be done thanks to the plot method: gridExtra::grid.arrange( plot(myPCA_cov, map = "individual", ind_cols = trichoptera$Group, plot = FALSE),
plot(myPCA_cov, map = "variable", plot = FALSE),
ncol = 2
) We can check that the fitted value of the counts – even with this low-rank covariance matrix – are close to the observed ones:

data.frame(
fitted   = as.vector(fitted(myPCA_cov)),
observed = as.vector(trichoptera\$Abundance)
) %>%
ggplot(aes(x = observed, y = fitted)) +
geom_point(size = .5, alpha =.25 ) +
scale_x_log10(limits = c(1,1000)) +
scale_y_log10(limits = c(1,1000)) +
theme_bw() + annotation_logticks() fitted value vs. observation

1. With our PLN-PCA (and any pPCA model for count data, where successive models are not nested), it is important to performed the model selection of $$q$$ prior to vizualization, since the model with rank $$q=3$$ is not nested in the model with rank $$q=4$$. Hence, percentage of variance must be interpreted with care: it sums to 100% but must be put in perspective with the model $$R^2$$, giving an approximation of the total percentage of variance explained with the current model.