It is easiest to load raw genotype data from the disk if it is available as a text file, usually in comma-delimited (.csv) format. The standard R functions read.table or read.csv can be used to accomplish this. However, in strataG, the readGenData function has been provided for .csv files, which is a wrapper for read.csv that sets commonly used values for missing data and removes blank lines.
gen.data <- readGenData("msats.csv")
str(gen.data)
## 'data.frame': 126 obs. of 12 variables:
## $ ids : chr "4495" "4496" "4498" "5814" ...
## $ strata : chr "Offshore.North" "Offshore.North" "Offshore.North" "Offshore.North" ...
## $ D11t.1 : chr "133" "137" "133" "135" ...
## $ D11t.2 : chr "133" "141" "133" "139" ...
## $ EV37.1 : chr NA NA "220" "210" ...
## $ EV37.2 : chr NA NA "220" "226" ...
## $ EV94.1 : chr NA "243" "245" "251" ...
## $ EV94.2 : chr NA "251" "265" "265" ...
## $ Ttr11.1: chr "211" NA "209" "209" ...
## $ Ttr11.2: chr "211" NA "211" "209" ...
## $ Ttr34.1: chr "185" "189" "189" "185" ...
## $ Ttr34.2: chr "191" "189" "189" "187" ...
For sequence data stored in FASTA format, the read.fasta function is available, which is a wrapper for the read.dna function in the ape package with standard FASTA arguments set. This will create a DNAbin object in the workspace:
fname <- system.file("extdata/dolph.seqs.fasta", package = "strataG")
x <- read.fasta(fname)
x
## 126 DNA sequences in binary format stored in a list.
##
## All sequences of same length: 402
##
## Labels:
## 4495
## 4496
## 4498
## 5814
## 5815
## 5816
## ...
##
## Base composition:
## a c g t
## 0.301 0.229 0.129 0.341
For sequences stored in other formats, read.dna should be used directly.
For most functions in strataG, you will need to load your data into a gtypes object. A gtypes object is an R S4 class with the following slots:
ids
.A gtypes object can be created with the standard S4 new constructor function, like this:
#--- create a diploid (microsatellite) gtypes object
data(dolph.msats)
data(dolph.strata)
strata.schemes <- dolph.strata[, c("broad", "fine")]
rownames(strata.schemes) <- dolph.strata$id
msats <- new("gtypes", gen.data = dolph.msats[, -1], ploidy = 2,
ind.names = dolph.msats[, 1], schemes = strata.schemes)
## Warning in .local(.Object, ...): the length of 'strata' is not the same as
## the number of individuals. strata will be recycled.
#--- create a haploid sequence (mtDNA) gtypes object
data(dolph.seqs)
dloop.haps <- cbind(haps = dolph.strata$id)
rownames(dloop.haps) <- dolph.strata$id
dloop <- new("gtypes", gen.data = dloop.haps, ploidy = 1,
sequences = dolph.seqs)
## Warning in .local(.Object, ...): the length of 'strata' is not the same as
## the number of individuals. strata will be recycled.
Both of these examples create a gtypes where every individual is in a single default stratum. To stratify samples when created, supply either a vector, factor, or stratification scheme name to the ‘strata’ argument in new, like so:
msats.fine <- new("gtypes", gen.data = dolph.msats[, -1], ploidy = 2,
ind.names = dolph.msats[, 1], schemes = strata.schemes, strata = "fine")
## Warning in .local(.Object, ...): the length of 'strata' is not the same as
## the number of individuals. strata will be recycled.
The df2gtypes function assumes that you have a matrix or data.frame with columns for individual ids, stratification, and locus data. You then specify the columns in the data.frame where this information can be found. df2gtypes can be used for data with multiple alleles per locus, like this:
# create a single data.frame with the msat data and stratification
msats.merge <- merge(dolph.strata, dolph.msats, all.y = TRUE, description = date())
str(msats.merge)
## 'data.frame': 126 obs. of 14 variables:
## $ id : chr "18650" "18651" "18652" "18653" ...
## $ dLoop : chr "Hap.06" "Hap.14" "Hap.23" "Hap.24" ...
## $ broad : chr "Offshore" "Offshore" "Offshore" "Offshore" ...
## $ fine : chr "Offshore.South" "Offshore.South" "Offshore.South" "Offshore.South" ...
## $ D11t.1 : chr "131" "137" "131" "133" ...
## $ D11t.2 : chr "133" "143" "133" "139" ...
## $ EV37.1 : chr "212" "200" "212" "202" ...
## $ EV37.2 : chr "222" "228" "218" "220" ...
## $ EV94.1 : chr "263" "249" "249" "229" ...
## $ EV94.2 : chr "265" "251" "251" "245" ...
## $ Ttr11.1: chr "197" "197" "211" "197" ...
## $ Ttr11.2: chr "209" "197" "213" "215" ...
## $ Ttr34.1: chr "185" "185" "185" "189" ...
## $ Ttr34.2: chr "187" "187" "191" "195" ...
# create the gtypes object
msats.fine <- df2gtypes(msats.merge, ploidy = 2, id.col = 1, strata.col = 3, loc.col = 5)
…or for haploid data, like this:
data(dolph.seqs)
seq.df <- dolph.strata[ c("id", "broad", "id")]
colnames(seq.df)[3] <- "D-loop"
dl.g <- df2gtypes(seq.df, ploidy = 1, sequences = dolph.seqs)
dl.g
##
## <<< gtypes created on 2017-04-10 20:52:02 >>>
##
## Contents: 126 samples, 1 locus, 2 strata
##
## Strata summary:
## num.samples num.missing num.alleles prop.unique.alleles
## Coastal 68 0 68 1
## Offshore 58 0 58 1
## heterozygosity
## Coastal 1
## Offshore 1
Note that since each sequence in dolph.seqs is for a given individual, the pct.unique.alleles and heterozygosity values are for every stratum 1 and num.samples equals num.alleles. In order to convert the sequences to unique haplotypes, use the labelHaplotypes function:
dl.haps <- labelHaplotypes(dl.g)
dl.haps$gtypes
##
## <<< gtypes created on 2017-04-10 20:52:02 >>>
##
## Contents: 126 samples, 1 locus, 2 strata
##
## Strata summary:
## num.samples num.missing num.alleles prop.unique.alleles
## Coastal 68 0 5 0.0000000
## Offshore 58 0 29 0.4827586
## heterozygosity
## Coastal 0.7427568
## Offshore 0.9594676
The sequence2gtypes function creates an unstratified gtype object with just a set of DNA sequences:
data(dolph.haps)
haps.g <- sequence2gtypes(dolph.haps)
## Warning in .local(.Object, ...): the length of 'strata' is not the same as
## the number of individuals. strata will be recycled.
haps.g
##
## <<< gtypes created on 2017-04-10 20:52:02 >>>
##
## Contents: 33 samples, 1 locus, 1 stratum
##
## Strata summary:
## num.samples num.missing num.alleles prop.unique.alleles
## Default 33 0 33 1
## heterozygosity
## Default 1
If you have a vector that identifies strata designations for the sequences, that can be supplied as well:
# extract and name the stratification scheme
strata <- dolph.strata$fine
names(strata) <- dolph.strata$ids
# create the gtypes object
dloop.fine <- sequence2gtypes(dolph.seqs, strata, seq.names = "dLoop",
description = "dLoop: fine-scale stratification")
dloop.fine
##
## <<< dLoop: fine-scale stratification >>>
##
## Contents: 126 samples, 1 locus, 3 strata
##
## Strata summary:
## num.samples num.missing num.alleles prop.unique.alleles
## Coastal 68 0 68 1
## Offshore.North 40 0 40 1
## Offshore.South 18 0 18 1
## heterozygosity
## Coastal 1
## Offshore.North 1
## Offshore.South 1
Note that stratification is generally provided for individuals, thus if you want to stratify the resulting gtypes object from sequence2gtypes, one sequence for each individual should be provided, rather than just a set of unique haplotypes.
This converts from a genind object from the adegenet package.
library(adegenet)
# from example(df2genind)
df <- data.frame(locusA=c("11","11","12","32"),
locusB=c(NA,"34","55","15"),
locusC=c("22","22","21","22"))
row.names(df) <- .genlab("genotype",4)
obj <- df2genind(df, ploidy=2, ncode=1)
obj
## /// GENIND OBJECT /////////
##
## // 4 individuals; 3 loci; 9 alleles; size: 6.1 Kb
##
## // Basic content
## @tab: 4 x 9 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 2-4)
## @loc.fac: locus factor for the 9 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: df2genind(X = df, ncode = 1, ploidy = 2)
##
## // Optional content
## - empty -
# convert to gtypes
gi.g <- genind2gtypes(obj)
## Warning in .local(.Object, ...): the length of 'strata' is not the same as
## the number of individuals. strata will be recycled.
gi.g
##
## <<< gtypes created on 2017-04-10 20:52:02 >>>
##
## Contents: 4 samples, 3 loci, 1 stratum
##
## Strata summary:
## num.samples num.missing num.alleles prop.unique.alleles
## Default 4 0.3333333 3 0.5277778
## heterozygosity
## Default 0.4722222
There are several functions for getting basic information from a gtypes object:
A gtypes object can be subset using the standard R ‘[’ indexing operation, with three slots: [i, j, k]. The first (i) specifies the desired individuals, the second (j) is the loci to return, and the third (k) is the strata. All standard R indexing operations involving numerical, character, or logical vectors work for each argument. For example, to return 10 random individuals:
sub.msats <- msats.fine[sample(nInd(msats.fine), 10), , ]
sub.msats
##
## <<< gtypes created on 2017-04-10 20:52:02 >>>
##
## Contents: 10 samples, 5 loci, 2 strata
##
## Strata summary:
## num.samples num.missing num.alleles prop.unique.alleles
## Coastal 5 0.2 3.0 0.1333333
## Offshore 5 0.0 6.2 0.6609524
## heterozygosity
## Coastal 0.71
## Offshore 0.76
…or to return specific loci:
sub.msats <- sub.msats[, c("D11t", "EV37", "TV7"), ]
sub.msats
##
## <<< gtypes created on 2017-04-10 20:52:02 >>>
##
## Contents: 10 samples, 2 loci, 2 strata
##
## Strata summary:
## num.samples num.missing num.alleles prop.unique.alleles
## Coastal 5 0.5 2.5 0.1666667
## Offshore 5 0.0 8.0 0.8571429
## heterozygosity
## Coastal 0.575
## Offshore 0.900
…or some loci in a specific stratum:
sub.msats <- msats.fine[, c("Ttr11", "D11t"), "Coastal"]
sub.msats
##
## <<< gtypes created on 2017-04-10 20:52:02 >>>
##
## Contents: 68 samples, 2 loci, 1 stratum
##
## Strata summary:
## num.samples num.missing num.alleles prop.unique.alleles
## Coastal 68 0.5 3.5 0
## heterozygosity
## Coastal 0.5773705
A summary function has been defined for gtypes, and the return value provides the number of individuals, loci, strata, allele frequencies, a by-strata summary, and a by-locus summary:
msats.smry <- summary(msats.fine)
str(msats.smry)
## List of 8
## $ num.ind : int 126
## $ num.loc : num 5
## $ num.strata : int 2
## $ unstratified: int 0
## $ strata.smry : num [1:2, 1:5] 68 58 1.2 0.8 4.8 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "Coastal" "Offshore"
## .. ..$ : chr [1:5] "num.samples" "num.missing" "num.alleles" "prop.unique.alleles" ...
## $ allele.freqs:List of 5
## ..$ D11t : num [1:12, 1:2, 1:2] 0 0 0 0 0 0 48 83 3 0 ...
## .. ..- attr(*, "dimnames")=List of 3
## .. .. ..$ : chr [1:12] "117" "119" "121" "127" ...
## .. .. ..$ : chr [1:2] "freq" "prop"
## .. .. ..$ : chr [1:2] "Coastal" "Offshore"
## ..$ EV37 : num [1:22, 1:2, 1:2] 0 0 0 0 0 0 0 1 71 33 ...
## .. ..- attr(*, "dimnames")=List of 3
## .. .. ..$ : chr [1:22] "190" "200" "202" "204" ...
## .. .. ..$ : chr [1:2] "freq" "prop"
## .. .. ..$ : chr [1:2] "Coastal" "Offshore"
## ..$ EV94 : num [1:15, 1:2, 1:2] 0 0 0 12 0 47 30 0 0 25 ...
## .. ..- attr(*, "dimnames")=List of 3
## .. .. ..$ : chr [1:15] "229" "239" "243" "245" ...
## .. .. ..$ : chr [1:2] "freq" "prop"
## .. .. ..$ : chr [1:2] "Coastal" "Offshore"
## ..$ Ttr11: num [1:9, 1:2, 1:2] 0 0 42 0 0 59 33 2 0 0 ...
## .. ..- attr(*, "dimnames")=List of 3
## .. .. ..$ : chr [1:9] "193" "197" "207" "209" ...
## .. .. ..$ : chr [1:2] "freq" "prop"
## .. .. ..$ : chr [1:2] "Coastal" "Offshore"
## ..$ Ttr34: num [1:10, 1:2, 1:2] 0 0 33 19 0 65 8 11 0 0 ...
## .. ..- attr(*, "dimnames")=List of 3
## .. .. ..$ : chr [1:10] "179" "183" "185" "187" ...
## .. .. ..$ : chr [1:2] "freq" "prop"
## .. .. ..$ : chr [1:2] "Coastal" "Offshore"
## $ sample.smry :'data.frame': 126 obs. of 5 variables:
## ..$ id : chr [1:126] "18650" "18651" "18652" "18653" ...
## ..$ strata : Factor w/ 2 levels "Offshore","Coastal": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ num.loci.missing.genotypes: int [1:126] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ pct.loci.missing.genotypes: num [1:126] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ pct.loci.homozygous : num [1:126] 0 0.2 0 0 0.2 0 0 0.2 0.2 0 ...
## $ locus.smry : num [1:5, 1:7] 125 119 125 125 126 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "D11t" "EV37" "EV94" "Ttr11" ...
## .. ..$ : chr [1:7] "num.genotyped" "prop.genotyped" "num.alleles" "allelic.richness" ...
## - attr(*, "description")= chr "gtypes created on 2017-04-10 20:52:02"
## - attr(*, "class")= chr [1:2] "gtypeSummary" "list"
For just the by-locus summary statistics:
summarizeLoci(msats.fine)
## num.genotyped prop.genotyped num.alleles allelic.richness
## D11t 125 0.9920635 12 0.09600000
## EV37 119 0.9444444 22 0.18487395
## EV94 125 0.9920635 15 0.12000000
## Ttr11 125 0.9920635 9 0.07200000
## Ttr34 126 1.0000000 10 0.07936508
## prop.unique.alleles exptd.heterozygosity obsvd.heterozygosity
## D11t 0.25000000 0.7468273 0.7040000
## EV37 0.13636364 0.8270751 0.6974790
## EV94 0.06666667 0.8304900 0.7760000
## Ttr11 0.22222222 0.7953414 0.7040000
## Ttr34 0.20000000 0.8144248 0.6984127
You can specify the stratification scheme when creating a gtypes object as in the examples above. Once a gtypes object has been created, you can also change the stratification scheme by either supplying a new vector for the @strata slot:
# randomly stratify individuals to two populations
strata(msats) <- sample(c("Pop1", "Pop2"), nInd(msats), rep = TRUE)
summary(msats)$strata.smry
## num.samples num.missing num.alleles prop.unique.alleles
## Pop1 90 0.7 12.2 0.20645743
## Pop2 100 1.3 11.0 0.07973856
## heterozygosity
## Pop1 0.2185181
## Pop2 0.2484917
or, if there is a stratification scheme data.frame in the @schemes slot, you can use the stratify function to choose a stratification scheme:
# choose "broad" stratification scheme
msats <- stratify(msats, "broad")
summary(msats)$strata.smry
## num.samples num.missing num.alleles prop.unique.alleles
## Coastal 68 1.2 4.8 0.08571429
## Offshore 58 0.8 13.6 0.17505051
## heterozygosity
## Coastal 0.6312283
## Offshore 0.8144326
If the @schemes slot is empty (NULL), you can fill it with a proper data.frame like this:
schemes(dloop) <- strata.schemes
NOTE: Filling or changing the @schemes slot does not affect the current stratification of the samples. You must then select a new stratification scheme or fill the @strata slot as above.
stratify(dloop, "fine")
##
## <<< gtypes created on 2017-04-10 20:52:02 >>>
##
## Contents: 126 samples, 1 locus, 3 strata
## Stratification schemes: broad, fine
##
## Strata summary:
## num.samples num.missing num.alleles prop.unique.alleles
## Coastal 68 0 68 1
## Offshore.North 40 0 40 1
## Offshore.South 18 0 18 1
## heterozygosity
## Coastal 1
## Offshore.North 1
## Offshore.South 1
If some samples should be unstratified (excluded from any stratified analyses), they should have NAs in the appropriate position in the @strata slot. For example:
# unstratify a random 10 samples
x <- strata(msats)
x[sample(indNames(msats), 10)] <- NA
strata(msats) <- x
summary(msats)$strata.smry
## num.samples num.missing num.alleles prop.unique.alleles
## Coastal 64 1.2 4.8 0.08571429
## Offshore 52 0.8 13.4 0.16682540
## heterozygosity
## Coastal 0.6333106
## Offshore 0.8122594
You can also randomly permute the current stratification scheme using the permuteStrata function like this:
msats <- stratify(msats, "fine")
# original
summary(msats)$strata.smry
## num.samples num.missing num.alleles prop.unique.alleles
## Coastal 68 1.2 4.8 0.08571429
## Offshore.North 40 0.8 12.6 0.22398990
## Offshore.South 18 0.0 11.0 0.25098291
## heterozygosity
## Coastal 0.6312283
## Offshore.North 0.7901822
## Offshore.South 0.8666667
# permuted
ran.msats <- permuteStrata(msats)
summary(ran.msats)$strata.smry
## num.samples num.missing num.alleles prop.unique.alleles
## Coastal 68 0.6 12.0 0.1495948
## Offshore.North 40 1.0 10.4 0.2654762
## Offshore.South 18 0.4 7.0 0.2369048
## heterozygosity
## Coastal 0.7420318
## Offshore.North 0.6977408
## Offshore.South 0.6541667
NOTE: Only samples assigned to strata are permuted with permuteStrata. Those not assigned (NAs) remain unassigned.
The allelic data in a gtypes object can be converted back to a matrix with as.matrix:
gen.mat <- as.matrix(msats)
head(gen.mat)
## ids strata D11t.1 D11t.2 EV37.1 EV37.2 EV94.1 EV94.2
## 18650 "18650" "Coastal" "131" "133" "212" "222" "263" "265"
## 18651 "18651" "Coastal" "137" "143" "200" "228" "249" "251"
## 18652 "18652" "Offshore.North" "131" "133" "212" "218" "249" "251"
## 18653 "18653" "Coastal" "133" "139" "202" "220" "229" "245"
## 18654 "18654" "Coastal" "131" "135" "214" "216" "249" "249"
## 18655 "18655" "Coastal" "131" "137" "222" "254" "243" "255"
## Ttr11.1 Ttr11.2 Ttr34.1 Ttr34.2
## 18650 "197" "209" "185" "187"
## 18651 "197" "197" "185" "187"
## 18652 "211" "213" "185" "191"
## 18653 "197" "215" "189" "195"
## 18654 "197" "211" "185" "187"
## 18655 "207" "211" "183" "193"
By default, this function splits each allele into its own column. One can make a matrix with one locus per column and alleles separated by a specified character by setting the one.col argument to TRUE:
gen.mat <- as.matrix(msats, one.col = TRUE)
head(gen.mat)
## ids strata D11t EV37 EV94 Ttr11
## 18650 "18650" "Coastal" "131/133" "212/222" "263/265" "197/209"
## 18651 "18651" "Coastal" "137/143" "200/228" "249/251" "197/197"
## 18652 "18652" "Offshore.North" "131/133" "212/218" "249/251" "211/213"
## 18653 "18653" "Coastal" "133/139" "202/220" "229/245" "197/215"
## 18654 "18654" "Coastal" "131/135" "214/216" "249/249" "197/211"
## 18655 "18655" "Coastal" "131/137" "222/254" "243/255" "207/211"
## Ttr34
## 18650 "185/187"
## 18651 "185/187"
## 18652 "185/191"
## 18653 "189/195"
## 18654 "185/187"
## 18655 "183/193"
The contents of a gtypes object can be written to a file with the write.gtypes function. This will write a .csv file with the allelic information and a .fasta file for any sequence data if it exists.