The bcRep package helps to analyze IMGT/HighV-QUEST* output in more detail. It functions well with B cells, but can also be used for T cell data, in some cases.
Using this package IMGT/HighV-QUEST output files can be analysed, f.e. functionality, junction frames, gene usage and mutations. The package provides functions to analyze clones out of the IMGT/HighV-QUEST output, but also methods to compare sequences and clones, are provided. Further a function to do principal coordinate analysis on dissimilarity/distance measurements is implemented. Almost all results can be visualized via implemented plotting functions.
*: Alamyar, E. et al., IMGT/HighV-QUEST: A High-Throughput System and Web Portal for the Analysis of Rearranged Nucleotide Sequences of Antigen Receptors, JOBIM 2010, Paper 63 (2010). Website: http://www.imgt.org/
bcRep depends on following packages: vegan, gplots, ineq, parallel, doParallel, foreach, ape, stringdist
.
Parallel processing is possible for functions clones(), clones.shared(), sequences.geneComb(), compare.aaDistribution(), compare.geneUsage()
and compare.trueDiversity()
. By default, 1 core is used.
There are several datasets provided in the package. Most of them are examples of IMGT/HighV-QUEST output; two additional files represent clonotype files and one gene usage data:
library(bcRep)
data(summarytab) # An extract from IMGT/HighV-QUEST output file 1_Summary(...).txt
data(ntseqtab) # An extract from IMGT/HighV-QUEST output file 3_Nt-sequences(...).txt
data(aaseqtab) # An extract from IMGT/HighV-QUEST output file 5_AA-sequences(...).txt
data(mutationtab) # An extract from IMGT/HighV-QUEST output file
# 7_V-REGION-mutation-and-AA-change-table(...).txt
data(clones.ind) # Clonotypes of one individual
data(clones.allind) # Clonotypes of eight individuals
data(vgenes) # V gene usage of 10 individuals (rows) and 30 genes (columns)
IMGT/HighV-QUEST can process datasets with up to 500.000 sequences. For larger data sets, FASTA files have to be splitted into smaller datasets and be uploaded individually to IMGT. Afterwards the function combineIMGT()
can be used to combine several folders.
# Example: combine folders IMGT1a, IMGT1b and IMGT1c to a new datset NewProject
combineIMGT(folders = c("pathTo/IMGT1a", "pathTo/IMGT1b", "pathTo/IMGT1c"),
name = "NewProject")
To read IMGT/HighV-QUEST output files, use readIMGT("PathTo/file.txt",filterNoResults=TRUE)
. You can choose between including or excluding sequences without any information (lines with a sequence ID, but “No results”). Spaces and “-” in headers will be replaced by “_“. If there is a particular IMGT table required as input for a function, this is mentioned in the corresponding help file.
Amino acid distribution can be analyzed using aaDistribution()
and visualized with plotAADistribution()
.
This function returns a list containing proportions of all amino acids (including stop codons “*“) for each analyzed sequence length. Optionally, also the number used for analysis can be given.
Figures can be saved as PDF in the working directory.
##
## Attaching package: 'bcRep'
## The following objects are masked _by_ '.GlobalEnv':
##
## clones, clones.IDlist, compare.geneUsage, geneUsage
## The following object is masked from 'package:tcR':
##
## geneUsage
# Example:
data(aaseqtab)
aadistr<-aaDistribution(sequences = aaseqtab$CDR3_IMGT, numberSeq = TRUE)
# First 4 columns of Amino acid distribution table:
# for a sequence length of 13 AA (* = stop codon)
# (aadistr$Amino_acid_distribution$`sequence length = 13`)
Position1 | Position2 | Position3 | Position4 | |
---|---|---|---|---|
F | 0 | 0 | 0.007018 | 0.01053 |
L | 0 | 0.003509 | 0.04561 | 0.08772 |
I | 0 | 0.02105 | 0.02456 | 0.03158 |
M | 0 | 0 | 0.01404 | 0.01754 |
V | 0.003509 | 0.007018 | 0.09123 | 0.04561 |
S | 0 | 0.07719 | 0.08421 | 0.07368 |
P | 0.003509 | 0 | 0.02105 | 0.07018 |
T | 0.02105 | 0.05614 | 0 | 0.03158 |
A | 0.9614 | 0.003509 | 0.05965 | 0.05614 |
Y | 0 | 0 | 0.01754 | 0.05263 |
H | 0 | 0.01404 | 0.01053 | 0.007018 |
C | 0 | 0 | 0.02105 | 0.007018 |
W | 0 | 0 | 0.007018 | 0.02807 |
N | 0 | 0.01404 | 0.007018 | 0.04211 |
D | 0 | 0 | 0.2281 | 0.06316 |
G | 0.003509 | 0.01053 | 0.1825 | 0.207 |
Q | 0 | 0 | 0.01053 | 0.01404 |
R | 0.007018 | 0.607 | 0.06316 | 0.0807 |
K | 0 | 0.186 | 0.007018 | 0.03509 |
E | 0 | 0 | 0.09825 | 0.0386 |
***** | 0 | 0 | 0 | 0 |
# Plot example for sequence lengths of 14-17 amino acids:
aadistr.part<-list(aadistr$Amino_acid_distribution[13:16],
data.frame(aadistr$Number_of_sequences_per_length[13:16,]))
names(aadistr.part)<-names(aadistr)
plotAADistribution(aaDistribution.tab=aadistr.part, plotAADistr=TRUE,
plotSeqN=TRUE, PDF=NULL)
Diversity of sequences and clones can be analyzed using trueDiversity()
or clones.giniIndex()
.
Using trueDiversity()
richness or diversity of sequences with the same length can be analyzed. Basically diversity of amino acids at each position is calculated. True diversity can be measured for orders q = 0, 1 or 2. plotTrueDiversity()
can be used for visualization. If mean.plot = F
, a figure for each sequence length including diversity per position will be returned. If mean.plot = T
only one figure with mean diversity values for all sequences with the same length will be returned.
Order 0: Richness (in this case it represents number of different amino acids per position).
Order 1: Exponential function of Shannon entropy using the natural logarithm (weights all amino acids by their frequency).
Order 2: Inverse Simpson entropy (weights all amino acids by their frequency, but weights are given more to abundant amino acids).
These indices are very similar (Hill, 1973). For example the exponential function of Shannon index is linearly related to inverse Simpson.
Amino acid sequences can be found in IMGT output table 5_AA-sequences(...).txt
.
# Example:
data(aaseqtab)
trueDiv<-trueDiversity(sequences = aaseqtab$CDR3_IMGT, order = 1)
# using exponent of Shannon entropy
# True diversity of order 1 for amino acid length of 5 AA
# (trueDiv$True_diversity$'sequence length = 5')
Position1 | Position2 | Position3 | Position4 | Position5 |
---|---|---|---|---|
1 | 3 | 3 | 1.89 | 3 |
# True diversity for sequences of amino acid length 14-17:
trueDiv.part<-list(trueDiv$True_diversity_order, trueDiv$True_diversity[13:16])
names(trueDiv.part)<-names(trueDiv)
plotTrueDiversity(trueDiversity.tab=trueDiv.part, mean.plot = F,color="darkblue", PDF=NULL)