Introduction

The tcR package designed to help researchers in the immunology field to analyse T cell receptor (TCR) and immunoglobulin (Ig) repertoires. In this vignette, I will cover procedures for immune receptor repertoire analysis provided with the package.

Terms:

Package features

  • Parsers for outputs of various tools for CDR3 extraction and genes alignment (currently implemented parsers for MiTCR, MiGEC, VDJtools, ImmunoSEQ, IMSEQ and MiXCR)
  • Data manipulation (in-frame / out-of-frame sequences subsetting, clonotype motif search)
  • Descriptive statistics (number of reads, number of clonotypes, gene segment usage)
  • Shared clonotypes statistics (number of shared clonotypes, using V genes or not; sequential intersection among the most abundant clonotype (“top-cross”))
  • Repertoire comparison (Jaccard index, Morisita’s overlap index, Horn’s index, Tversky index, overlap coefficient)
  • V- and J genes usage and it’s analysis (PCA, Shannon Entropy, Jensen-Shannon Divergence)
  • Diversity evaluation (ecological diversity index, Gini index, inverse Simpson index, rarefaction analysis)
  • Artificial repertoire generation (beta chain only, for now)
  • Spectratyping
  • Various visualisation procedures
  • Mutation networks (graphs, in which vertices represent CDR3 nucleotide / amino acid sequences and edges are connecting similar sequences with low hamming or edit distance between them)

Data in the package

There are two datasets provided with the package - twins data and V(D)J recombination genes data.

Downsampled twins data

twa.rda, twb.rda - two lists with 4 data frames in each list. Every data frame is a sample downsampled to the 10000 most abundant clonotypes of twins data (alpha and beta chains). Full data is available here:

Twins TCR data at Laboratory of Comparative and Functional Genomics

Explore the data:

# Load the package and load the data.
library(tcR)
data(twa)  # "twa" - list of length 4
data(twb)  # "twb" - list of length 4

# Explore the data.
head(twa[[1]])
head(twb[[1]])

Gene alphabets

Gene alphabets - character vectors with names of genes for TCR and Ig.

# Run help to see available alphabets.
?genealphabets
?genesegments
data(genesegments)

Quick start / automatic report generation

For the exploratory analysis of a single repertoire, use the RMarkdown report file at

"<path to the tcR package>/inst/library.report.Rmd"

Analysis in the file include statistics and visualisation of number of clones, clonotypes, in- and out-of-frame sequences, unique amino acid CDR3 sequences, V- and J-usage, most frequent k-mers, rarefaction analysis.

For the analysis of a group of repertoires (“cross-analysis”), use the RMarkdown report file at:

"<path to the tcR package>/inst/crossanalysis.report.Rmd}"

Analysis in this file include statistics and visualisation of number of shared clones and clonotypes, V-usage for individuals and groups, J-usage for individuals, Jensen-Shannon divergence among V-usages of repertoires and top-cross.

You need the knitr package installed in order to generate reports from default pipelines. In RStudio you can run a pipeline file as follows:

Run RStudio -> load the pipeline .Rmd files -> press the knitr button

Input parsing

Currently in tcR there are implemented parser for the next software:

  • MiTCR - parse.mitcr;

  • MiTCR w/ UMIs - parse.mitcrbc;

  • MiGEC - parse.migec;

  • VDJtools - parse.vdjtools;

  • ImmunoSEQ - parse.immunoseq;

  • MiXCR - parse.mixcr;

  • IMSEQ - parse.imseq.

Also a general parser parse.cloneset for a text table files is implemented. General wrapper for parsers is parse.file. User can also parse a list of files or the entire folder. Run ?parse.folder to see a help on parsing input files and a list of functions for parsing a specific input format.

# Parse file in "~/mitcr/immdata1.txt" as a MiTCR file.
immdata1 <- parse.file("~/mitcr_data/immdata1.txt", 'mitcr')
# equivalent to
immdata1.eq <- parse.mitcr("~/mitcr_data/immdata1.txt")

# Parse folder with MiGEC files.
immdata <- parse.folder("~/migec_data/", 'migec')

Cloneset representation

Clonesets represented in tcR as data frames with each row corresponding to the one nucleotide clonotype and with specific column names:

  • Umi.count - number of UMIs;
  • Umi.proportion - proportion of UMIs;
  • Read.count - number of reads;
  • Read.proportion - proportion of reads;
  • CDR3.nucleotide.sequence - CDR3 nucleotide sequence;
  • CDR3.amino.acid.sequence - CDR3 amino acid sequence;
  • V.gene - names of aligned Variable genes;
  • J.gene - names of aligned Joining genes;
  • D.gene - names of aligned Diversity genes;
  • V.end - last positions of aligned V genes (1-based);
  • J.start - first positions of aligned J genes (1-based);
  • D5.end - positions of D’5 end of aligned D genes (1-based);
  • D3.end - positions of D’3 end of aligned D genes (1-based);
  • VD.insertions - number of inserted nucleotides (N-nucleotides) at V-D junction (-1 for receptors with VJ recombination);
  • DJ.insertions - number of inserted nucleotides (N-nucleotides) at D-J junction (-1 for receptors with VJ recombination);
  • Total.insertions - total number of inserted nucleotides (number of N-nucleotides at V-J junction for receptors with VJ recombination).

Any data frame with this columns and of this class is suitable for processing with the package, hence user can generate their own table files and load them for the further analysis using read.csv, read.table and other base R functions. Please note that tcR internally expects all strings to be of class “character”, not “factor”. Therefore you should use R parsing functions with parameter stringsAsFactors=FALSE.

# No D genes is available here hence "" at "D.genes" and "-1" at positions.
str(twa[[1]])
## 'data.frame':    10000 obs. of  16 variables:
##  $ Umi.count               : logi  NA NA NA NA NA NA ...
##  $ Umi.proportion          : logi  NA NA NA NA NA NA ...
##  $ Read.count              : int  147158 58018 55223 41341 31525 30682 20913 16695 14757 13745 ...
##  $ Read.proportion         : num  0.0857 0.0338 0.0322 0.0241 0.0184 ...
##  $ CDR3.nucleotide.sequence: chr  "TGTGCTGTGATGGATAGCAACTATCAGTTAATCTGG" "TGTGCAGAGAAGTCTAGCAACACAGGCAAACTAATCTTT" "TGTGTGGTGAACTTTACTGGAGGCTTCAAAACTATCTTT" "TGTGCAGCAAGTATAGGGCTTGTATCTAACTTTGGAAATGAGAAATTAACCTTT" ...
##  $ CDR3.amino.acid.sequence: chr  "CAVMDSNYQLIW" "CAEKSSNTGKLIF" "CVVNFTGGFKTIF" "CAASIGLVSNFGNEKLTF" ...
##  $ V.gene                  : chr  "TRAV1-2" "TRAV5" "TRAV12-1" "TRAV13-1" ...
##  $ J.gene                  : chr  "TRAJ33" "TRAJ37" "TRAJ9" "TRAJ48" ...
##  $ D.gene                  : chr  "" "" "" "" ...
##  $ V.end                   : int  9 9 11 12 9 8 4 12 9 12 ...
##  $ J.start                 : int  10 12 14 22 19 9 15 13 15 14 ...
##  $ D5.end                  : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ D3.end                  : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ VD.insertions           : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ DJ.insertions           : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ Total.insertions        : int  0 2 2 9 9 0 10 0 5 1 ...
str(twb[[1]])
## 'data.frame':    10000 obs. of  16 variables:
##  $ Umi.count               : logi  NA NA NA NA NA NA ...
##  $ Umi.proportion          : logi  NA NA NA NA NA NA ...
##  $ Read.count              : int  81516 46158 32476 30356 27321 23760 22232 20968 20603 17429 ...
##  $ Read.proportion         : num  0.0578 0.0327 0.023 0.0215 0.0194 ...
##  $ CDR3.nucleotide.sequence: chr  "TGTGCCAGCAGCCAAGCTCTAGCGGGAGCAGATACGCAGTATTTT" "TGTGCCAGCAGCTTAGGCCCCAGGAACACCGGGGAGCTGTTTTTT" "TGTGCCAGCAGTTATGGAGGGGCGGCAGATACGCAGTATTTT" "TGCAGTGCTGGAGGGATTGAAACCTCCTACAATGAGCAGTTCTTC" ...
##  $ CDR3.amino.acid.sequence: chr  "CASSQALAGADTQYF" "CASSLGPRNTGELFF" "CASSYGGAADTQYF" "CSAGGIETSYNEQFF" ...
##  $ V.gene                  : chr  "TRBV4-2" "TRBV13" "TRBV12-4, TRBV12-3" "TRBV20-1" ...
##  $ J.gene                  : chr  "TRBJ2-3" "TRBJ2-2" "TRBJ2-3" "TRBJ2-1" ...
##  $ D.gene                  : chr  "TRBD2" "TRBD1, TRBD2" "TRBD2" "TRBD1, TRBD2" ...
##  $ V.end                   : int  15 16 12 12 13 9 11 14 12 14 ...
##  $ J.start                 : int  18 17 15 13 20 15 14 16 15 17 ...
##  $ D5.end                  : int  27 20 20 15 23 21 19 18 17 21 ...
##  $ D3.end                  : int  28 23 25 23 24 22 21 23 20 28 ...
##  $ VD.insertions           : int  2 0 2 0 6 5 2 1 2 2 ...
##  $ DJ.insertions           : int  0 2 4 7 0 0 1 4 2 6 ...
##  $ Total.insertions        : int  2 2 6 7 6 5 3 5 4 8 ...

Repertoire descriptive statistics

For the exploratory analysis tcR provides various functions for computing descriptive statistics.

Cloneset summary

To get a general view of a subject’s repertoire (overall count of sequences, in- and out-of-frames numbers and proportions) use the cloneset.stats function. It returns a summary of counts of nucleotide and amino acid clonotypes, as well as summary of read counts:

cloneset.stats(twb)
##        #Nucleotide clones #Aminoacid clonotypes %Aminoacid clonotypes
## Subj.A              10000                  9850                0.9850
## Subj.B              10000                  9838                0.9838
## Subj.C              10000                  9775                0.9775
## Subj.D              10000                  9872                0.9872
##        #In-frames %In-frames #Out-of-frames %Out-of-frames Sum.reads
## Subj.A       9622     0.9622            346         0.0346   1410263
## Subj.B       9564     0.9564            400         0.0400   2251408
## Subj.C       9791     0.9791            192         0.0192    969949
## Subj.D       9225     0.9225            712         0.0712   1419130
##        Min.reads 1st Qu.reads Median.reads Mean.reads 3rd Qu.reads
## Subj.A        22           26           33     141.00           57
## Subj.B        20           24           31     225.10           55
## Subj.C        23           28           39      96.99           68
## Subj.D        32           37           48     141.90           83
##        Max.reads
## Subj.A     81520
## Subj.B    171200
## Subj.C    104600
## Subj.D     33590

For characterisation of a library use the repseq.stats function:

repseq.stats(twb)
##        Clones Sum.reads Reads.per.clone
## Subj.A  10000   1410263          141.03
## Subj.B  10000   2251408          225.14
## Subj.C  10000    969949           96.99
## Subj.D  10000   1419130          141.91

Most abundant clonotypes statistics

Function clonal.proportion is used to get the number of most abundant by the count of reads clonotypes. E.g., compute number of clonotypes which fill up (approx.) the 25% from total repertoire’s “Read.count”:

                            # How many clonotypes fill up approximately
clonal.proportion(twb, 25)  # the 25% of the sum of values in 'Read.count'?
##        Clones Percentage Clonal.count.prop
## Subj.A     12       25.1            0.0012
## Subj.B      6       26.5            0.0006
## Subj.C      7       25.2            0.0007
## Subj.D     38       25.2            0.0038

To get a proportion of the most abundant clonotypes’ sum of reads to the overall number of reads in a repertoire, use top.proportion, i.e. get

(\(\sum\) reads of top clonotypes)\(/\)(\(\sum\) reads for all clonotypes).

E.g., get a proportion of the top-10 clonotypes’ reads to the overall number of reads:

                          # What accounts a proportion of the top-10 clonotypes' reads
top.proportion(twb, 10)   # to the overall number of reads?
##    Subj.A    Subj.B    Subj.C    Subj.D 
## 0.2289069 0.3648699 0.2620158 0.1305398
vis.top.proportions(twb)  # Plot this proportions.