Quality of genotypes and sequences can be assessed with several functions. For genotypic data, by-locus summaries can be conducted with the summarizeLoci function, which will also produce summaries for each strata:
data(msats.g)
msats <- msats.g
smry <- summarizeLoci(msats)
head(smry)
## num.genotyped prop.genotyped num.alleles allelic.richness
## D11t 125 0.99 12 0.096
## EV37 119 0.94 22 0.185
## EV94 125 0.99 15 0.120
## Ttr11 125 0.99 9 0.072
## Ttr34 126 1.00 10 0.079
## prop.unique.alleles exptd.heterozygosity obsvd.heterozygosity
## D11t 0.250 0.75 0.70
## EV37 0.136 0.83 0.70
## EV94 0.067 0.83 0.78
## Ttr11 0.222 0.80 0.70
## Ttr34 0.200 0.81 0.70
The dupGenotypes function identifies samples that have the same or nearly the same genotypes. The number (or percent) of loci that must be shared in order for it to be considered a duplicate can be set by the num.shared argument. The return data.frame provides which loci the two samples show mismatches at so they can be reviewed.
# Find samples that share alleles at 2/3rds of the loci
dupGenotypes(msats, num.shared = 0.66)
## ids.1 ids.2 strata.1 strata.2 num.loci.genotyped
## 1 41579 45237 Coastal Coastal 4
## 2 23945 78065 Coastal Coastal 5
## 3 25503 78053 Coastal Coastal 5
## 4 25509 41822 Coastal Coastal 5
## 5 41540 78040 Coastal Coastal 5
## 6 41578 45233 Coastal Coastal 5
## 7 44720 78058 Coastal Coastal 5
## 8 45230 78040 Coastal Coastal 5
## 9 78034 78040 Coastal Coastal 5
## 10 78034 78043 Coastal Coastal 5
## 11 78035 78053 Coastal Coastal 5
## 12 78045 78058 Coastal Coastal 5
## 13 40916 78038 Coastal Coastal 4
## 14 41820 42193 Coastal Coastal 4
## 15 42193 44719 Coastal Coastal 4
## 16 42193 45230 Coastal Coastal 4
## 17 42193 78035 Coastal Coastal 4
## 18 42193 78045 Coastal Coastal 4
## 19 41821 4495 Coastal Offshore.North 3
## 20 4495 78041 Offshore.North Coastal 3
## 21 4495 78054 Offshore.North Coastal 3
## 22 4495 78056 Offshore.North Coastal 3
## num.loci.shared prop.loci.shared mismatch.loci
## 1 4 1.00
## 2 4 0.80 EV37
## 3 4 0.80 Ttr11
## 4 4 0.80 Ttr34
## 5 4 0.80 D11t
## 6 4 0.80 EV94
## 7 4 0.80 EV94
## 8 4 0.80 Ttr11
## 9 4 0.80 EV94
## 10 4 0.80 EV37
## 11 4 0.80 D11t
## 12 4 0.80 Ttr34
## 13 3 0.75 Ttr34
## 14 3 0.75 Ttr11
## 15 3 0.75 EV94
## 16 3 0.75 EV94
## 17 3 0.75 Ttr34
## 18 3 0.75 EV94
## 19 2 0.67 Ttr11
## 20 2 0.67 Ttr11
## 21 2 0.67 Ttr11
## 22 2 0.67 Ttr11
The start and end positions and number of N’s and indels can be generated with the summarizeSeqs function:
library(ape)
data(dolph.seqs)
seq.smry <- summarizeSeqs(as.DNAbin(dolph.seqs))
head(seq.smry)
## start end length num.ns num.indels
## 4495 1 402 402 0 2
## 4496 1 402 402 0 2
## 4498 1 402 402 0 1
## 5814 1 402 402 0 2
## 5815 1 402 402 0 2
## 5816 1 402 402 0 2
Base frequencies can be generated with baseFreqs:
bf <- baseFreqs(as.DNAbin(dolph.seqs))
# nucleotide frequencies by site
bf$site.freq[, 1:15]
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## a 0 126 126 126 126 126 5 0 0 0 0 126 0 0 0
## c 0 0 0 0 0 0 0 0 126 0 0 0 0 0 0
## g 126 0 0 0 0 0 0 126 0 0 0 0 0 0 126
## t 0 0 0 0 0 0 0 0 0 126 126 0 126 126 0
## u 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## r 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## b 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## h 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# overall nucleotide frequencies
bf$base.freqs
##
## a c g t u r y m k w
## 0.2997 0.2282 0.1283 0.3389 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## s b d h v n x - .
## 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0048 0.0000
Sequences can be scanned for low-frequency substitutions with lowFreqSubs:
lowFreqSubs(as.DNAbin(dolph.seqs), min.freq = 2)
## id site base motif
## 1 23792 274 t cctattgatcc
## 2 23794 287 g cctccgttata
## 3 26304 274 a cctataaatcc
## 4 26304 394 t taccttgtggg
## 5 74962 57 a taaaaataatt
## 6 74962 104 g catacgcatgt
## 7 74962 392 t catgctccgtg
## 8 74962 393 c atgctccgtgg
Unusual sequences can be identified by plotting likelihoods based on pairwise distances:
data(dolph.haps)
sequenceLikelihoods(as.DNAbin(dolph.haps))
## id mean.dist negLogLik deltaLogLik
## 1 Hap.01 8.0 85 1.7
## 2 Hap.02 6.3 99 16.0
## 3 Hap.03 7.1 92 8.9
## 4 Hap.04 8.1 92 8.8
## 5 Hap.05 7.9 86 2.9
## 6 Hap.06 13.1 102 19.2
## 7 Hap.07 7.2 90 6.8
## 8 Hap.08 8.9 85 1.4
## 9 Hap.09 7.3 91 7.5
## 10 Hap.10 7.8 94 11.1
## 11 Hap.11 7.7 83 0.0
## 12 Hap.12 11.6 91 7.4
## 13 Hap.13 8.6 88 4.3
## 14 Hap.14 8.2 91 7.6
## 15 Hap.15 7.4 98 14.8
## 16 Hap.16 8.3 86 2.9
## 17 Hap.17 7.2 86 2.2
## 18 Hap.18 11.7 90 7.2
## 19 Hap.19 11.3 90 7.0
## 20 Hap.20 8.1 87 3.3
## 21 Hap.21 8.5 88 4.3
## 22 Hap.22 12.7 104 20.7
## 23 Hap.23 9.7 93 9.2
## 24 Hap.24 8.8 86 2.7
## 25 Hap.25 11.2 85 1.8
## 26 Hap.26 8.7 87 3.3
## 27 Hap.27 7.2 86 3.2
## 28 Hap.28 11.0 84 1.2
## 29 Hap.29 11.8 97 13.4
## 30 Hap.30 8.9 93 9.4
## 31 Hap.31 7.2 91 8.1
## 32 Hap.32 13.7 110 26.3
## 33 Hap.33 7.9 92 8.8
All of the above functions can be conducted at once with the qaqc function. Only those functions appropriate to the data type contained (haploid or diploid) will be run. Files are written for each output that are labelled either by the @description slot of the gtypes object or the optional label argument of the function.