The tcR package designed to help researchers in the immunology field to analyse T cell receptor (TCR
) and immunoglobulin (Ig
) repertoires. In this vignette, I will cover procedures for immune receptor repertoire analysis provided with the package.
Terms:
Clonotype: a group of T / B cell clones with equal CDR3 nucleotide sequences and equal Variable genes.
Cloneset / repertoire: a set of clonotypes. Represented as a data frame in which each row corresponds to a unique clonotype.
UMI: Unique Molecular Identifier (see this paper for details)
There are two datasets provided with the package - twins data and V(D)J recombination genes data.
twa.rda
, twb.rda
- two lists with 4 data frames in each list. Every data frame is a sample downsampled to the 10000 most abundant clonotypes of twins data (alpha and beta chains). Full data is available here:
Twins TCR data at Laboratory of Comparative and Functional Genomics
Explore the data:
# Load the package and load the data.
library(tcR)
data(twa) # "twa" - list of length 4
data(twb) # "twb" - list of length 4
# Explore the data.
head(twa[[1]])
head(twb[[1]])
Gene alphabets - character vectors with names of genes for TCR and Ig.
# Run help to see available alphabets.
?genealphabets
?genesegments
data(genesegments)
For the exploratory analysis of a single repertoire, use the RMarkdown report file at
"<path to the tcR package>/inst/library.report.Rmd"
Analysis in the file include statistics and visualisation of number of clones, clonotypes, in- and out-of-frame sequences, unique amino acid CDR3 sequences, V- and J-usage, most frequent k-mers, rarefaction analysis.
For the analysis of a group of repertoires (“cross-analysis”), use the RMarkdown report file at:
"<path to the tcR package>/inst/crossanalysis.report.Rmd}"
Analysis in this file include statistics and visualisation of number of shared clones and clonotypes, V-usage for individuals and groups, J-usage for individuals, Jensen-Shannon divergence among V-usages of repertoires and top-cross.
You need the knitr package installed in order to generate reports from default pipelines. In RStudio you can run a pipeline file as follows:
Run RStudio -> load the pipeline .Rmd files -> press the knitr button
Currently in tcR there are implemented parser for the next software:
MiTCR - parse.mitcr
;
MiTCR w/ UMIs - parse.mitcrbc
;
MiGEC - parse.migec
;
VDJtools - parse.vdjtools
;
ImmunoSEQ - parse.immunoseq
;
MiXCR - parse.mixcr
;
IMSEQ - parse.imseq
.
Also a general parser parse.cloneset
for a text table files is implemented. General wrapper for parsers is parse.file
. User can also parse a list of files or the entire folder. Run ?parse.folder
to see a help on parsing input files and a list of functions for parsing a specific input format.
# Parse file in "~/mitcr/immdata1.txt" as a MiTCR file.
immdata1 <- parse.file("~/mitcr_data/immdata1.txt", 'mitcr')
# equivalent to
immdata1.eq <- parse.mitcr("~/mitcr_data/immdata1.txt")
# Parse folder with MiGEC files.
immdata <- parse.folder("~/migec_data/", 'migec')
Clonesets represented in tcR as data frames with each row corresponding to the one nucleotide clonotype and with specific column names:
Any data frame with this columns and of this class is suitable for processing with the package, hence user can generate their own table files and load them for the further analysis using read.csv
, read.table
and other base
R functions. Please note that tcR internally expects all strings to be of class “character”, not “factor”. Therefore you should use R parsing functions with parameter stringsAsFactors=FALSE.
# No D genes is available here hence "" at "D.genes" and "-1" at positions.
str(twa[[1]])
## 'data.frame': 10000 obs. of 16 variables:
## $ Umi.count : logi NA NA NA NA NA NA ...
## $ Umi.proportion : logi NA NA NA NA NA NA ...
## $ Read.count : int 147158 58018 55223 41341 31525 30682 20913 16695 14757 13745 ...
## $ Read.proportion : num 0.0857 0.0338 0.0322 0.0241 0.0184 ...
## $ CDR3.nucleotide.sequence: chr "TGTGCTGTGATGGATAGCAACTATCAGTTAATCTGG" "TGTGCAGAGAAGTCTAGCAACACAGGCAAACTAATCTTT" "TGTGTGGTGAACTTTACTGGAGGCTTCAAAACTATCTTT" "TGTGCAGCAAGTATAGGGCTTGTATCTAACTTTGGAAATGAGAAATTAACCTTT" ...
## $ CDR3.amino.acid.sequence: chr "CAVMDSNYQLIW" "CAEKSSNTGKLIF" "CVVNFTGGFKTIF" "CAASIGLVSNFGNEKLTF" ...
## $ V.gene : chr "TRAV1-2" "TRAV5" "TRAV12-1" "TRAV13-1" ...
## $ J.gene : chr "TRAJ33" "TRAJ37" "TRAJ9" "TRAJ48" ...
## $ D.gene : chr "" "" "" "" ...
## $ V.end : int 9 9 11 12 9 8 4 12 9 12 ...
## $ J.start : int 10 12 14 22 19 9 15 13 15 14 ...
## $ D5.end : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ D3.end : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ VD.insertions : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ DJ.insertions : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ Total.insertions : int 0 2 2 9 9 0 10 0 5 1 ...
str(twb[[1]])
## 'data.frame': 10000 obs. of 16 variables:
## $ Umi.count : logi NA NA NA NA NA NA ...
## $ Umi.proportion : logi NA NA NA NA NA NA ...
## $ Read.count : int 81516 46158 32476 30356 27321 23760 22232 20968 20603 17429 ...
## $ Read.proportion : num 0.0578 0.0327 0.023 0.0215 0.0194 ...
## $ CDR3.nucleotide.sequence: chr "TGTGCCAGCAGCCAAGCTCTAGCGGGAGCAGATACGCAGTATTTT" "TGTGCCAGCAGCTTAGGCCCCAGGAACACCGGGGAGCTGTTTTTT" "TGTGCCAGCAGTTATGGAGGGGCGGCAGATACGCAGTATTTT" "TGCAGTGCTGGAGGGATTGAAACCTCCTACAATGAGCAGTTCTTC" ...
## $ CDR3.amino.acid.sequence: chr "CASSQALAGADTQYF" "CASSLGPRNTGELFF" "CASSYGGAADTQYF" "CSAGGIETSYNEQFF" ...
## $ V.gene : chr "TRBV4-2" "TRBV13" "TRBV12-4, TRBV12-3" "TRBV20-1" ...
## $ J.gene : chr "TRBJ2-3" "TRBJ2-2" "TRBJ2-3" "TRBJ2-1" ...
## $ D.gene : chr "TRBD2" "TRBD1, TRBD2" "TRBD2" "TRBD1, TRBD2" ...
## $ V.end : int 15 16 12 12 13 9 11 14 12 14 ...
## $ J.start : int 18 17 15 13 20 15 14 16 15 17 ...
## $ D5.end : int 27 20 20 15 23 21 19 18 17 21 ...
## $ D3.end : int 28 23 25 23 24 22 21 23 20 28 ...
## $ VD.insertions : int 2 0 2 0 6 5 2 1 2 2 ...
## $ DJ.insertions : int 0 2 4 7 0 0 1 4 2 6 ...
## $ Total.insertions : int 2 2 6 7 6 5 3 5 4 8 ...
For the exploratory analysis tcR provides various functions for computing descriptive statistics.
To get a general view of a subject’s repertoire (overall count of sequences, in- and out-of-frames numbers and proportions) use the cloneset.stats
function. It returns a summary
of counts of nucleotide and amino acid clonotypes, as well as summary of read counts:
cloneset.stats(twb)
## #Nucleotide clones #Aminoacid clonotypes %Aminoacid clonotypes
## Subj.A 10000 9850 0.9850
## Subj.B 10000 9838 0.9838
## Subj.C 10000 9775 0.9775
## Subj.D 10000 9872 0.9872
## #In-frames %In-frames #Out-of-frames %Out-of-frames Sum.reads
## Subj.A 9622 0.9622 346 0.0346 1410263
## Subj.B 9564 0.9564 400 0.0400 2251408
## Subj.C 9791 0.9791 192 0.0192 969949
## Subj.D 9225 0.9225 712 0.0712 1419130
## Min.reads 1st Qu.reads Median.reads Mean.reads 3rd Qu.reads
## Subj.A 22 26 33 141.00 57
## Subj.B 20 24 31 225.10 55
## Subj.C 23 28 39 96.99 68
## Subj.D 32 37 48 141.90 83
## Max.reads
## Subj.A 81520
## Subj.B 171200
## Subj.C 104600
## Subj.D 33590
For characterisation of a library use the repseq.stats
function:
repseq.stats(twb)
## Clones Sum.reads Reads.per.clone
## Subj.A 10000 1410263 141.03
## Subj.B 10000 2251408 225.14
## Subj.C 10000 969949 96.99
## Subj.D 10000 1419130 141.91
Function clonal.proportion
is used to get the number of most abundant by the count of reads clonotypes. E.g., compute number of clonotypes which fill up (approx.) the 25% from total repertoire’s “Read.count”:
# How many clonotypes fill up approximately
clonal.proportion(twb, 25) # the 25% of the sum of values in 'Read.count'?
## Clones Percentage Clonal.count.prop
## Subj.A 12 25.1 0.0012
## Subj.B 6 26.5 0.0006
## Subj.C 7 25.2 0.0007
## Subj.D 38 25.2 0.0038
To get a proportion of the most abundant clonotypes’ sum of reads to the overall number of reads in a repertoire, use top.proportion
, i.e. get
(\(\sum\) reads of top clonotypes)\(/\)(\(\sum\) reads for all clonotypes).
E.g., get a proportion of the top-10 clonotypes’ reads to the overall number of reads:
# What accounts a proportion of the top-10 clonotypes' reads
top.proportion(twb, 10) # to the overall number of reads?
## Subj.A Subj.B Subj.C Subj.D
## 0.2289069 0.3648699 0.2620158 0.1305398
vis.top.proportions(twb) # Plot this proportions.