This vignette documents the data format used in **PLNmodel** by `PLN`

and its variants. It also shows how to create an object in the proper format for further analyses from (i) tabular data, (ii) biom-class objects and (iii) phyloseq-class objects.

We illustrate the format using trichoptera data set, a full description of which can be found in the corresponding vignette.

The trichoptera data set is a list made of two data frames: `Abundance`

(hereafter referred to as the *counts*) and `Covariate`

(hereafter the *covariates*).

```
## List of 2
## $ Abundance:'data.frame': 49 obs. of 17 variables:
## $ Covariate:'data.frame': 49 obs. of 7 variables:
```

The covariates include, among others, the wind, pressure and humidity.

```
## [1] "Temperature" "Wind" "Pressure" "Humidity"
## [5] "Cloudiness" "Precipitation" "Group"
```

In the PLN framework, we model the counts from the covariates, let’s say wind and pressure, using a Poisson Log-Normal model. Most models in R use the so-called *formula interface* and it would thus be naturel to write something like

Unfortunately and unlike many generalized linear models, the response in PLN is intrinsically **multivariate**: it has 17 dimensions in our example. The left hand side (LHS) must encode a multivariate response across multiple samples, using a 2D-array (e.g. a matrix or a data frame).

We must therefore prepare a data structure where `Abundance`

refers to a count *matrix* whereas `Wind`

and `Pressure`

refer to *vectors* before feeding it to `PLN`

. That’s the purpose of `prepare_data`

.

```
trichoptera2 <- prepare_data(counts = trichoptera$Abundance,
covariates = trichoptera$Covariate)
str(trichoptera2)
```

```
## 'data.frame': 49 obs. of 9 variables:
## $ Abundance : num [1:49, 1:17] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr "1" "2" "3" "4" ...
## .. ..$ : chr "Che" "Hyc" "Hym" "Hys" ...
## $ Temperature : num 18.7 19.8 22 23 22.5 23.9 15 17.2 15.4 14.1 ...
## $ Wind : num -2.3 -2.7 -0.7 2.3 2.3 -2 -4.7 -1 -2.7 -3.7 ...
## $ Pressure : num 998 1000 997 991 990 ...
## $ Humidity : num 60 63 73 71 62 64 93 84 88 75 ...
## $ Cloudiness : num 19 0 6 81 50 50 100 19 69 6 ...
## $ Precipitation: num 0 0 0 0 0 0 1.6 0 1.6 0 ...
## $ Group : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Offset : num 29 13 38 192 79 18 8 34 12 4 ...
```

If you look carefully, you can notice a few difference between `trichoptera`

and `trichoptera2`

:

- the first is a
`list`

whereas the second is a`data.frame`

^{1}; `Abundance`

is a matrix-column of`trichoptera2`

that you can extract using the usual functions`[`

and`[[`

to retrieve the count matrix;`trichoptera2`

has an additional`Offset`

column (more on that later).

It is common practice when modeling count data to introduce an offset term to control for different sampling efforts, exposures, baselines, etc. The *proper way* to compute sample-specific offsets in still debated and may vary depending on the field. There are nevertheless a few popular methods:

- Total Sum Scaling (TSS), where the offset of a sample is the total count in that sample
- Cumulative Sum Scaling (CSS), introduced in (Paulson et al. 2013), where the offset of a sample if the cumulative sum of counts in that sample, up to a quantile determined in a data driven way.
- Relative Log-Expression (RLE), implemented in (Anders and Huber 2010), where all samples are used to compute a reference sample, each sample is compared to the reference sample using log-ratios and the offset is the median log-ratio.
- Geometric Mean of Pairwise Ratio (GMPR), introduced in (Chen et al. 2018) where each sample is compared to each other to compute a median log-ratio and the offset of a sample is the geometric means of those pairwise ratios.

Each of these offset be computed from a counts matrix using the `compute_offset`

function and changing its `offset`

argument:

```
## same as compute_offset(trichoptera$Abundance, offset = "TSS")
compute_offset(trichoptera$Abundance)
```

```
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 29 13 38 192 79 18 8 34 12 4 4 3 49 33 600 172
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 58 51 56 127 35 13 17 3 27 40 44 8 9 1599 2980 88
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 135 327 66 90 63 15 14 20 70 53 95 43 62 149 16 31
## 49
## 86
```

In this particular example, the counts are too sparse and sophisticated offset methods all fail (numeric output hidden)

```
## Warning in offset_function(counts, ...): Some samples only have 1 positive
## values. Can't compute quantiles and fall back to TSS normalization
```

```
## Warning in offset_function(counts, ...): Because of high sparsity, some samples
## have null or infinite offset.
```

We can mitigate this problem for the RLE offset by adding pseudocounts to the counts although doing so has its own drawbacks.

```
## 1 2 3 4 5 6 7 8
## 0.9186270 0.8349121 0.8570257 0.9186270 0.9186270 0.8349121 0.8192245 0.8570257
## 9 10 11 12 13 14 15 16
## 0.7322797 0.7322797 0.6321923 0.6321923 0.9240361 0.9186270 1.3788037 0.9240361
## 17 18 19 20 21 22 23 24
## 0.9186270 0.9240361 0.9240361 1.7140514 0.8908577 0.8570257 0.8349121 0.6321923
## 25 26 27 28 29 30 31 32
## 0.9240361 0.9240361 0.9186270 0.8570257 0.8349121 2.7721084 3.2934218 0.9584503
## 33 34 35 36 37 38 39 40
## 1.0406547 0.9584503 0.9584503 0.9186270 0.9584503 0.8908577 0.8349121 0.7322797
## 41 42 43 44 45 46 47 48
## 0.9240361 0.9240361 0.9584503 0.9584503 1.2643846 1.7140514 0.8233555 0.8908577
## 49
## 0.9186270
```

`prepare_data`

We’ll already learned that `prepare_data`

can join counts and covariates into a single data.frame. It can also compute offset through `compute_offset`

and does so by default with `offset = "TSS"`

, hence the `Offset`

column in `trichoptera2`

. You can change the offset method and provide additional arguments that will passed on to `compute_offset`

.

```
## 'data.frame': 49 obs. of 9 variables:
## $ Abundance : num [1:49, 1:17] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr "1" "2" "3" "4" ...
## .. ..$ : chr "Che" "Hyc" "Hym" "Hys" ...
## $ Temperature : num 18.7 19.8 22 23 22.5 23.9 15 17.2 15.4 14.1 ...
## $ Wind : num -2.3 -2.7 -0.7 2.3 2.3 -2 -4.7 -1 -2.7 -3.7 ...
## $ Pressure : num 998 1000 997 991 990 ...
## $ Humidity : num 60 63 73 71 62 64 93 84 88 75 ...
## $ Cloudiness : num 19 0 6 81 50 50 100 19 69 6 ...
## $ Precipitation: num 0 0 0 0 0 0 1.6 0 1.6 0 ...
## $ Group : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Offset : num 0.919 0.835 0.857 0.919 0.919 ...
```

Different communities use different standard for the count data where samples are either or columns of the counts matrix. `prepare_data`

uses heuristics to guess the direction of the counts matrix (or fail informatively doing so) and automatically transpose it if needed.

Finally, `prepare_data`

enforces sample-consistency between the counts and the covariates and automatically trims away: - samples for which only covariates or only counts are available; - samples with no positive counts

For example, if we remove the first sample from the counts and the last one from the covariates, we end up with 49 - 2 = 47 samples left, as expected.

```
nrow(prepare_data(trichoptera$Abundance[-1, ], ## remove first sample
trichoptera$Covariate[-49,] ## remove last sample
))
```

`## [1] 47`

`prepare_data_from_[phyloseq|biom]`

Community composition data are quite popular in microbial ecology and usually stored in flat files using the biom format and/or imported in R as phyloseq-class objects (McMurdie 2013) using the Bioconductor phyloseq package.

We show here how to import data from a biom file (or biom-class object) and form a phyloseq-class object.

Reading from a biom file requires the bioconductor package biomformat. This package is **not** a standard dependency of PLNmodels and needs to be installed separately.

You can easily prepare your data from a biom file using the following steps:

- read your biom file with
`biomformat::read_biom()`

- extract the count table with
`biomformat::biom_data()`

- extract the covariates with
`biomformat::sample_metadata()`

(or build your own) - feed them to
`prepare_data`

as illustrated below:

```
## If biomformat is not installed, uncomment the following lines
# if (!requireNamespace("BiocManager", quietly = TRUE)) {
# install.packages("BiocManager")
# }
# BiocManager::install("biomformat")
library(biomformat)
biomfile <- system.file("extdata", "rich_dense_otu_table.biom", package = "biomformat")
biom <- biomformat::read_biom(biomfile)
## extract counts
counts <- as(biomformat::biom_data(biom), "matrix")
## extract covariates (or prepare your own)
covariates <- biomformat::sample_metadata(biom)
## prepare data
my_data <- prepare_data(counts = counts, covariates = covariates)
str(my_data)
```

Likewise, preparing data from a phyloseq-class object requires the bioconductor package phyloseq. This package is **not** a standard dependency of PLNmodels and needs to be installed separately.

You can easily prepare your data from a phyloseq object using the following steps:

- extract the count table with
`phyloseq::otu_table()`

- extract the covariates with
`phyloseq::sample_data()`

(or build your own) - feed them to
`prepare_data`

as illustrated below:

```
## If biomformat is not installed, uncomment the following lines
# if (!requireNamespace("BiocManager", quietly = TRUE)) {
# install.packages("BiocManager")
# }
# BiocManager::install("phyloseq")
library(phyloseq)
data("enterotype")
## extract counts
counts <- as(phyloseq::otu_table(enterotype), "matrix")
## extract covariates (or prepare your own)
covariates <- phyloseq::sample_data(enterotype)
## prepare data
my_data <- prepare_data(counts = counts, covariates = covariates)
str(my_data)
```

We detail here the mathematical background behind the various offsets and the way they are computed. Note \(\mathbf{C} = (c_{ij})\) the counts matrix where \(c_{ij}\) is the count of species \(j\) in sample \(i\). Assume that there are \(J\) species in total. The offset of sample \(i\) is noted \(O_i\) and computed in the following way.

Offsets are simply the total counts of a sample: \[ O_i = \sum_{j=1}^J c_{ij} \]

Positive counts are used to compute sample-specific quantiles \(q_i^l\) and cumulative sums \(s_i^l\) defined as \[
q_i^l = \min \{q \text{ such that } \sum_j 1_{c_{ij} \leq q} \geq l \sum_j 1_{c_{ij} > 0} \} \qquad s_i^l = \sum_{j: c_{ij} \leq q_i^l} c_{ij}
\] The sample-specific quantiles are then used to compute reference quantiles defined as \(q^l = \text{median} \{q^i_l\}\) and median average deviation around the quantile \(q^l\) as \(d^l = \text{median} |q_i^l - q^l|\). The method then searches for the smallest quantile \(l\) for which it detects instability, defined as large relative increase in the \(d^l\). Formally, \(\hat{l}\) is the smallest \(l\) satisfying \(\frac{d^{l+1} - d^l}{d^l} \geq 0.1\). The scaling sample-specific offset are then chosen as: \[
O_i = s_i^{\hat{l}} / \text{median}_i \{ s_i^{\hat{l}} \}
\] Dividing by the median of the \(s_i^{\hat{l}}\) ensures that offsets are centered around \(1\) and compare sizes differences with respect to the reference sample. Note also that the reference quantiles \(q^l\) can be computed using either the median (default, as in the original Paulson et al. (2013) paper) or the mean, by specifying `reference = mean`

, as implemented in `metagenomeseq`

.

A reference sample \((q_j)_j\) is first built by computing the geometric means of each species count: \[
q_j = \exp \left( \frac{1}{n} \sum_{i} \log(c_{ij})\right)
\] Each sample is then compared to the reference sample to compute one ratio per species and the final offset \(O_i\) is the median of those ratios: \[
O_i = \text{median}_j \frac{c_{ij}}{q_j}
\] The method fails when no species is shared across all sample (as all \(q_j\) are then \(0\)) or when a sample shares less than 50% of species with the reference (in which case the median of the ratios may be null or infinite). The problem can be alleviated by adding pseudocounts to the \(c_{ij}\) with `pseudocounts = 1`

.

This method is similar to RLE but does create a reference sample. Instead, each sample is compared to each other to compute a median ratio (similar to RLE) \[ r_{ii'} = {\text{median}}_{j: c_{ij}.c_{i'j} > 0} \frac{c_{ij}}{c_{i'j}} \] The offset is then taken as the median of all the \(r_{ii'}\): \[ O_i = \text{median}_{i' != i} r_{ii'} \] The method fails when there is only one sample in the data set or when a sample shares no species with any other.

Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” *Genome Biology* 11 (10): R106. https://doi.org/10.1186/gb-2010-11-10-r106.

Chen, Li, James Reeve, Lujun Zhang, Shengbing Huang, Xuefeng Wang, and Jun Chen. 2018. “GMPR: A Robust Normalization Method for Zero-Inflated Count Data with Application to Microbiome Sequencing Data.” *PeerJ* 6 (April): e4600. https://doi.org/10.7717/peerj.4600.

McMurdie, Paul J. AND Holmes. 2013. “Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” *PLoS ONE* 8 (4). Public Library of Science: e61217. https://doi.org/10.1371/journal.pone.0061217.

Paulson, Joseph N, O. Colin Stine, Héctor Corrada Bravo, and Mihai Pop. 2013. “Differential Abundance Analysis for Microbial Marker-Gene Surveys.” *Nat Methods* 10 (September): 1200–1202. https://doi.org/10.1038/nmeth.2658.

although a

`data.frame`

is technically a`list`

↩