cytoftree: extension of cytometree to analyze mass cytometry data

Anthony Devaux, Boris Hejblum

2019-08-19

Introduction to cytoftree

cytoftree is an extension to cytometree function to analyze mass cytometry data. These data are specific due to a high number of zero and the high number of markers (up to 100 potentially). cytoftree is based on cytometree’s algorithm which is the construction of binary tree, whose nodes represents cell sub-populations, and slighly modified to take into account the specification of mass cytometry data.

Data transformation

According to the literature, mass cytometry data must be transform to get best partitions. We propose different transformations: asinh (as default), biexp, log10 or none (without transformation).

Binary tree construction

  1. At each node, for each marker, the cells with zero values are temporarily set aside from the other cells.

  2. The remaining observed cells (or “events”) and markers are modeled by both a normal distribution (so unimodal), and a mixture of 2 normal distributions (so bimodal).

  3. If the AIC normalized differences \(D\) are significant, the cells are split into 2 groups according to the bimodal distribution. Cells with low values are annotated - (no marker) while cells with high values are annotated + (with marker). The cells with zero values are also annotated - (no marker).

  4. The binary tree is constructed until the cells can no longer be split into 2 groups.

Post-hoc annotation

Given the unsupervised nature of the binary tree, some of the available markers may not be used to find the different cell populations present in a given sample. To recover a complete annotation, we defined, as a post processing procedure, an annotation method which allows the user to distinguish two (or three) expression levels per marker.

Influenza vaccine response dataset analysis with cytoftree

In this example, we will use an influenza vaccine response dataset (from ImmuneSpace), with 39 markers. To speed-up the computation, we sampled 10 000 cells from this dataset.

Data preparation

First, we can look the structure and the markers of the data.

library(cytometree)
data(IMdata)
dim(IMdata)
## [1] 10000    39
colnames(IMdata)
##  [1] "Time"             "Cell_length"      "(In113)Dd_CD57"  
##  [4] "(In115)Dd_Dead"   "(Ce140)Dd_Bead"   "(Nd142)Dd_CD19"  
##  [7] "(Nd143)Dd_CD4"    "(Nd144)Dd_CD8"    "(Nd146)Dd_IgD"   
## [10] "(Sm147)Dd_CD85j"  "(Nd148)Dd_CD11c"  "(Sm149)Dd_CD16"  
## [13] "(Nd150)Dd_CD3"    "(Eu151)Dd_CD38"   "(Sm152)Dd_CD27"  
## [16] "(Eu153)Dd_CD11b"  "(Sm154)Dd_CD14"   "(Gd155)Dd_CCR6"  
## [19] "(Gd156)Dd_CD94"   "(Gd157)Dd_CD86"   "(Gd158)Dd_CXCR5" 
## [22] "(Tb159)Dd_CXCR3"  "(Gd160)Dd_CCR7"   "(Dy162)Dd_CD45RA"
## [25] "(Dy164)Dd_CD20"   "(Ho165)Dd_CD127"  "(Er166)Dd_CD33"  
## [28] "(Er167)Dd_CD28"   "(Er168)Dd_CD24"   "(Tm169)Dd_ICOS"  
## [31] "(Er170)Dd_CD161"  "(Yb171)Dd_TCRgd"  "(Yb172)Dd_PD-1"  
## [34] "(Yb173)Dd_CD123"  "(Yb174)Dd_CD56"   "(Lu175)Dd_HLADR" 
## [37] "(Yb176)Dd_CD25"   "(Ir191)Dd_DNA1"   "(Ir193)Dd_DNA2"

Then, we also check the proportion of zero for each marker, particularity of mass cytometry data.

zero_proportion <- apply(IMdata[, -c(1, 2)], MARGIN = 2, FUN = function(x) {
    round(prop.table(table(x == 0))["TRUE"] * 100, 2)
})
zero_proportion
##   (In113)Dd_CD57   (In115)Dd_Dead   (Ce140)Dd_Bead   (Nd142)Dd_CD19 
##            76.26            60.55            91.00            70.34 
##    (Nd143)Dd_CD4    (Nd144)Dd_CD8    (Nd146)Dd_IgD  (Sm147)Dd_CD85j 
##            38.92            45.04            70.61            55.14 
##  (Nd148)Dd_CD11c   (Sm149)Dd_CD16    (Nd150)Dd_CD3   (Eu151)Dd_CD38 
##            54.75            63.78            18.75            16.30 
##   (Sm152)Dd_CD27  (Eu153)Dd_CD11b   (Sm154)Dd_CD14   (Gd155)Dd_CCR6 
##            23.19            33.66            56.47            68.81 
##   (Gd156)Dd_CD94   (Gd157)Dd_CD86  (Gd158)Dd_CXCR5  (Tb159)Dd_CXCR3 
##            63.89            57.71            56.91            52.07 
##   (Gd160)Dd_CCR7 (Dy162)Dd_CD45RA   (Dy164)Dd_CD20  (Ho165)Dd_CD127 
##            29.54             8.67            49.11            34.57 
##   (Er166)Dd_CD33   (Er167)Dd_CD28   (Er168)Dd_CD24   (Tm169)Dd_ICOS 
##            39.43            31.08            64.00            74.00 
##  (Er170)Dd_CD161  (Yb171)Dd_TCRgd   (Yb172)Dd_PD-1  (Yb173)Dd_CD123 
##            76.15            90.25            81.13            75.48 
##   (Yb174)Dd_CD56  (Lu175)Dd_HLADR   (Yb176)Dd_CD25   (Ir191)Dd_DNA1 
##            66.72            45.16            56.40             1.42 
##   (Ir193)Dd_DNA2 
##             2.63

CytofTree function

According to the available markers, a gating strategy may be considered. In this example, we have a gating strategy to conserve only viable cells by splitting on the following markers : DNA1, DNA2, Cell_length, Bead and Dead. This way, we can be as close as possible to manual gating. To do this, we have to force the markers with the force_first_marker option (semi-supervised gating).

Then, to improve the performance of automating gating, we decided to transform data with asinh transformation (default transformation). Then, we have to choose which markers should be transformed using num_col argument. The columns Time et Cell_length are not mass cytometry measure and shouldn’t be transformed.

num_col <- c(3:ncol(IMdata))

tree <- CytofTree(M = IMdata, minleaf = 1, t = 0.1, verbose = FALSE, force_first_markers = c("(Ir191)Dd_DNA1", 
    "(Ir193)Dd_DNA2", "Cell_length", "(Ce140)Dd_Bead", "(In115)Dd_Dead"), transformation = "asinh", 
    num_col = num_col)
## Unable to force split on (In115)Dd_Dead for some node at level5
## Unable to force split on (In115)Dd_Dead for some node at level5
## It works !
max(tree$labels)
## [1] 823

High dimensional issues

Due to the high number of markers, cytoftree provides high number of sub-populations. minleaf value for the minimum of cells by sub-population and t threshold for the depth of the binary tree can be modified to get more or less sub-populations. The plot_graph function provides a look on the binary tree, but should be unreadable due to the high number of sub-populations.

Annotation function

The annotation function allows to recover the incomplete annotation on sub-populations. combinations option provides the complete annotation on each sub-population.

annot <- Annotation(tree, plot = FALSE, K2markers = colnames(IMdata))
annot$combinations[1:5, ]
##     Time Cell_length (In113)Dd_CD57 (In115)Dd_Dead (Ce140)Dd_Bead
## 786    1           0              0              0              0
## 592    1           0              0              0              0
## 621    1           0              0              0              0
## 817    1           0              0              0              0
## 635    1           0              0              0              0
##     (Nd142)Dd_CD19 (Nd143)Dd_CD4 (Nd144)Dd_CD8 (Nd146)Dd_IgD
## 786              0             1             0             0
## 592              0             0             1             0
## 621              1             0             0             1
## 817              0             1             0             0
## 635              0             0             0             0
##     (Sm147)Dd_CD85j (Nd148)Dd_CD11c (Sm149)Dd_CD16 (Nd150)Dd_CD3
## 786               0               0              0             1
## 592               0               0              0             1
## 621               1               0              0             0
## 817               0               0              0             1
## 635               0               0              0             0
##     (Eu151)Dd_CD38 (Sm152)Dd_CD27 (Eu153)Dd_CD11b (Sm154)Dd_CD14
## 786              0              1               0              0
## 592              0              1               0              0
## 621              0              0               0              0
## 817              0              1               0              0
## 635              0              0               0              0
##     (Gd155)Dd_CCR6 (Gd156)Dd_CD94 (Gd157)Dd_CD86 (Gd158)Dd_CXCR5
## 786              0              0              0               0
## 592              0              0              0               0
## 621              1              0              0               1
## 817              0              0              0               0
## 635              0              0              0               0
##     (Tb159)Dd_CXCR3 (Gd160)Dd_CCR7 (Dy162)Dd_CD45RA (Dy164)Dd_CD20
## 786               0              1                1              0
## 592               0              1                1              0
## 621               0              0                1              1
## 817               0              1                0              0
## 635               0              0                0              0
##     (Ho165)Dd_CD127 (Er166)Dd_CD33 (Er167)Dd_CD28 (Er168)Dd_CD24
## 786               0              0              1              0
## 592               0              0              1              0
## 621               0              0              0              0
## 817               0              0              1              0
## 635               0              0              0              0
##     (Tm169)Dd_ICOS (Er170)Dd_CD161 (Yb171)Dd_TCRgd (Yb172)Dd_PD-1
## 786              0               0               0              0
## 592              0               0               0              0
## 621              0               0               0              0
## 817              0               0               0              0
## 635              0               0               0              0
##     (Yb173)Dd_CD123 (Yb174)Dd_CD56 (Lu175)Dd_HLADR (Yb176)Dd_CD25
## 786               0              0               0              0
## 592               0              0               0              0
## 621               0              0               1              0
## 817               0              0               0              0
## 635               0              0               0              0
##     (Ir191)Dd_DNA1 (Ir193)Dd_DNA2 leaves count   prop
## 786              1              1    786   357 0.0357
## 592              1              1    592   182 0.0182
## 621              1              1    621   119 0.0119
## 817              1              1    817    96 0.0096
## 635              0              0    635    92 0.0092

Due to the high number of sub-populations, it’s recommended to use RetrievePops function which provide informations for particular sub-populations.

RetrievePops : providing informations for particular sub-populations

RetrievePops provides several informations on specific sub-populations, in particular the proportions and the sub-populations merged.

phenotypes <- list()
phenotypes[["CD4+"]] <- rbind(c("(Ir191)Dd_DNA1", 1), c("(Ir193)Dd_DNA2", 1), 
    c("Cell_length", 0), c("(Ce140)Dd_Bead", 0), c("(In115)Dd_Dead", 0), c("(Sm154)Dd_CD14", 
        0), c("(Er166)Dd_CD33", 0), c("(Nd150)Dd_CD3", 1), c("(Nd143)Dd_CD4", 
        1))

phenotypes[["CD8+"]] <- rbind(c("(Ir191)Dd_DNA1", 1), c("(Ir193)Dd_DNA2", 1), 
    c("Cell_length", 0), c("(Ce140)Dd_Bead", 0), c("(In115)Dd_Dead", 0), c("(Sm154)Dd_CD14", 
        0), c("(Er166)Dd_CD33", 0), c("(Nd150)Dd_CD3", 1), c("(Nd144)Dd_CD8", 
        1))

pheno_result <- RetrievePops(annot, phenotypes = phenotypes)

# CD4+
pheno_result$phenotypesinfo[[1]]
## $phenotype
## [1] "(Ir191)Dd_DNA1-1" "(Ir193)Dd_DNA2-1" "Cell_length-0"   
## [4] "(Ce140)Dd_Bead-0" "(In115)Dd_Dead-0" "(Sm154)Dd_CD14-0"
## [7] "(Er166)Dd_CD33-0" "(Nd150)Dd_CD3-1"  "(Nd143)Dd_CD4-1" 
## 
## $proportion
## [1] 0.1953
## 
## $Mergedlabels
##   [1] 786 817 785 816 738 772 384 696 470 713 813 823 800 468 805 801 617
##  [18] 639 642 788 807 773 821 561 669 220 820 325 803 819 471 555 641 711
##  [35]  66 527 556 614 714 769 171 469 712 810 818 215 616 618 645 740 248
##  [52] 354 381 500 667 695 815 176 216 350 557 558 665 668 670 697 698 716
##  [69] 717 771 806 326 352 585 646 664 671 766 811 814 174 181 183 351 355
##  [86] 386 499 559 560 619 643 644 647 715 770 804 113 170 172 175 182 219
## [103] 221 247 283 385 465 528 554 666 787
## 
## $Newlabel
## [1] 824
# CD8+
pheno_result$phenotypesinfo[[2]]
## $phenotype
## [1] "(Ir191)Dd_DNA1-1" "(Ir193)Dd_DNA2-1" "Cell_length-0"   
## [4] "(Ce140)Dd_Bead-0" "(In115)Dd_Dead-0" "(Sm154)Dd_CD14-0"
## [7] "(Er166)Dd_CD33-0" "(Nd150)Dd_CD3-1"  "(Nd144)Dd_CD8-1" 
## 
## $proportion
## [1] 0.0752
## 
## $Mergedlabels
##  [1] 592 672 699 620 473 534 590 186 220 587 288 506 562 428 531 586 589
## [18] 674 180 215 254 328 388 505 533 359 390 564 565  68 114 179 187 216
## [35] 290 361 387 472 563 718 741 742 329 358 360 389 529 178 181 183 289
## [52] 391 532 182 185 474 530 591
## 
## $Newlabel
## [1] 825

Proportions comparison between manual and automatic gating

We can compare proportions providing by automatic gating (cytoftree) and manual gating for the selected sub-populations.

CD4+ CD8+
Manual Gating 0.182 0.065
Automating Gating 0.195 0.075

cytoftree provides good results, close to the proportions getting by manual gating.