Creating and Manipulating gtypes Objects

Eric Archer

2017-04-10

Raw data

It is easiest to load raw genotype data from the disk if it is available as a text file, usually in comma-delimited (.csv) format. The standard R functions read.table or read.csv can be used to accomplish this. However, in strataG, the readGenData function has been provided for .csv files, which is a wrapper for read.csv that sets commonly used values for missing data and removes blank lines.

gen.data <- readGenData("msats.csv")
str(gen.data)
## 'data.frame':    126 obs. of  12 variables:
##  $ ids    : chr  "4495" "4496" "4498" "5814" ...
##  $ strata : chr  "Offshore.North" "Offshore.North" "Offshore.North" "Offshore.North" ...
##  $ D11t.1 : chr  "133" "137" "133" "135" ...
##  $ D11t.2 : chr  "133" "141" "133" "139" ...
##  $ EV37.1 : chr  NA NA "220" "210" ...
##  $ EV37.2 : chr  NA NA "220" "226" ...
##  $ EV94.1 : chr  NA "243" "245" "251" ...
##  $ EV94.2 : chr  NA "251" "265" "265" ...
##  $ Ttr11.1: chr  "211" NA "209" "209" ...
##  $ Ttr11.2: chr  "211" NA "211" "209" ...
##  $ Ttr34.1: chr  "185" "189" "189" "185" ...
##  $ Ttr34.2: chr  "191" "189" "189" "187" ...

For sequence data stored in FASTA format, the read.fasta function is available, which is a wrapper for the read.dna function in the ape package with standard FASTA arguments set. This will create a DNAbin object in the workspace:

fname <- system.file("extdata/dolph.seqs.fasta", package = "strataG")
x <- read.fasta(fname) 
x
## 126 DNA sequences in binary format stored in a list.
## 
## All sequences of same length: 402 
## 
## Labels:
## 4495
## 4496
## 4498
## 5814
## 5815
## 5816
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.301 0.229 0.129 0.341

For sequences stored in other formats, read.dna should be used directly.

Construction

For most functions in strataG, you will need to load your data into a gtypes object. A gtypes object is an R S4 class with the following slots:

new constructor

A gtypes object can be created with the standard S4 new constructor function, like this:

#--- create a diploid (microsatellite) gtypes object
data(dolph.msats)
data(dolph.strata)
strata.schemes <- dolph.strata[, c("broad", "fine")]
rownames(strata.schemes) <- dolph.strata$id
msats <- new("gtypes", gen.data = dolph.msats[, -1], ploidy = 2,
  ind.names = dolph.msats[, 1], schemes = strata.schemes)
## Warning in .local(.Object, ...): the length of 'strata' is not the same as
## the number of individuals. strata will be recycled.
#--- create a haploid sequence (mtDNA) gtypes object
data(dolph.seqs)
dloop.haps <- cbind(haps = dolph.strata$id)
rownames(dloop.haps) <- dolph.strata$id
dloop <- new("gtypes", gen.data = dloop.haps, ploidy = 1, 
  sequences = dolph.seqs)
## Warning in .local(.Object, ...): the length of 'strata' is not the same as
## the number of individuals. strata will be recycled.

Both of these examples create a gtypes where every individual is in a single default stratum. To stratify samples when created, supply either a vector, factor, or stratification scheme name to the ‘strata’ argument in new, like so:

msats.fine <- new("gtypes", gen.data = dolph.msats[, -1], ploidy = 2,
  ind.names = dolph.msats[, 1], schemes = strata.schemes, strata = "fine")
## Warning in .local(.Object, ...): the length of 'strata' is not the same as
## the number of individuals. strata will be recycled.

df2gtypes - Convert a data.frame or matrix

The df2gtypes function assumes that you have a matrix or data.frame with columns for individual ids, stratification, and locus data. You then specify the columns in the data.frame where this information can be found. df2gtypes can be used for data with multiple alleles per locus, like this:

# create a single data.frame with the msat data and stratification
msats.merge <- merge(dolph.strata, dolph.msats, all.y = TRUE, description = date())
str(msats.merge)
## 'data.frame':    126 obs. of  14 variables:
##  $ id     : chr  "18650" "18651" "18652" "18653" ...
##  $ dLoop  : chr  "Hap.06" "Hap.14" "Hap.23" "Hap.24" ...
##  $ broad  : chr  "Offshore" "Offshore" "Offshore" "Offshore" ...
##  $ fine   : chr  "Offshore.South" "Offshore.South" "Offshore.South" "Offshore.South" ...
##  $ D11t.1 : chr  "131" "137" "131" "133" ...
##  $ D11t.2 : chr  "133" "143" "133" "139" ...
##  $ EV37.1 : chr  "212" "200" "212" "202" ...
##  $ EV37.2 : chr  "222" "228" "218" "220" ...
##  $ EV94.1 : chr  "263" "249" "249" "229" ...
##  $ EV94.2 : chr  "265" "251" "251" "245" ...
##  $ Ttr11.1: chr  "197" "197" "211" "197" ...
##  $ Ttr11.2: chr  "209" "197" "213" "215" ...
##  $ Ttr34.1: chr  "185" "185" "185" "189" ...
##  $ Ttr34.2: chr  "187" "187" "191" "195" ...
# create the gtypes object
msats.fine <- df2gtypes(msats.merge, ploidy = 2, id.col = 1, strata.col = 3, loc.col = 5)

…or for haploid data, like this:

data(dolph.seqs)

seq.df <- dolph.strata[ c("id", "broad", "id")]
colnames(seq.df)[3] <- "D-loop"
dl.g <- df2gtypes(seq.df, ploidy = 1, sequences = dolph.seqs)
dl.g
## 
## <<< gtypes created on 2017-04-10 20:52:02 >>>
## 
## Contents: 126 samples, 1 locus, 2 strata
## 
## Strata summary:
##          num.samples num.missing num.alleles prop.unique.alleles
## Coastal           68           0          68                   1
## Offshore          58           0          58                   1
##          heterozygosity
## Coastal               1
## Offshore              1

Note that since each sequence in dolph.seqs is for a given individual, the pct.unique.alleles and heterozygosity values are for every stratum 1 and num.samples equals num.alleles. In order to convert the sequences to unique haplotypes, use the labelHaplotypes function:

dl.haps <- labelHaplotypes(dl.g)
dl.haps$gtypes
## 
## <<< gtypes created on 2017-04-10 20:52:02 >>>
## 
## Contents: 126 samples, 1 locus, 2 strata
## 
## Strata summary:
##          num.samples num.missing num.alleles prop.unique.alleles
## Coastal           68           0           5           0.0000000
## Offshore          58           0          29           0.4827586
##          heterozygosity
## Coastal       0.7427568
## Offshore      0.9594676

sequence2gtypes - Convert DNA sequences

The sequence2gtypes function creates an unstratified gtype object with just a set of DNA sequences:

data(dolph.haps)

haps.g <- sequence2gtypes(dolph.haps)
## Warning in .local(.Object, ...): the length of 'strata' is not the same as
## the number of individuals. strata will be recycled.
haps.g
## 
## <<< gtypes created on 2017-04-10 20:52:02 >>>
## 
## Contents: 33 samples, 1 locus, 1 stratum
## 
## Strata summary:
##         num.samples num.missing num.alleles prop.unique.alleles
## Default          33           0          33                   1
##         heterozygosity
## Default              1

If you have a vector that identifies strata designations for the sequences, that can be supplied as well:

# extract and name the stratification scheme
strata <- dolph.strata$fine
names(strata) <- dolph.strata$ids

# create the gtypes object
dloop.fine <- sequence2gtypes(dolph.seqs, strata, seq.names = "dLoop",
  description = "dLoop: fine-scale stratification")
dloop.fine
## 
## <<< dLoop: fine-scale stratification >>>
## 
## Contents: 126 samples, 1 locus, 3 strata
## 
## Strata summary:
##                num.samples num.missing num.alleles prop.unique.alleles
## Coastal                 68           0          68                   1
## Offshore.North          40           0          40                   1
## Offshore.South          18           0          18                   1
##                heterozygosity
## Coastal                     1
## Offshore.North              1
## Offshore.South              1

Note that stratification is generally provided for individuals, thus if you want to stratify the resulting gtypes object from sequence2gtypes, one sequence for each individual should be provided, rather than just a set of unique haplotypes.

genind2gtypes - Convert genind object

This converts from a genind object from the adegenet package.

library(adegenet)
# from example(df2genind)
df <- data.frame(locusA=c("11","11","12","32"),
                 locusB=c(NA,"34","55","15"),
                 locusC=c("22","22","21","22"))
row.names(df) <- .genlab("genotype",4)
obj <- df2genind(df, ploidy=2, ncode=1)
obj
## /// GENIND OBJECT /////////
## 
##  // 4 individuals; 3 loci; 9 alleles; size: 6.1 Kb
## 
##  // Basic content
##    @tab:  4 x 9 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 2-4)
##    @loc.fac: locus factor for the 9 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: df2genind(X = df, ncode = 1, ploidy = 2)
## 
##  // Optional content
##    - empty -
# convert to gtypes
gi.g <- genind2gtypes(obj)
## Warning in .local(.Object, ...): the length of 'strata' is not the same as
## the number of individuals. strata will be recycled.
gi.g
## 
## <<< gtypes created on 2017-04-10 20:52:02 >>>
## 
## Contents: 4 samples, 3 loci, 1 stratum
## 
## Strata summary:
##         num.samples num.missing num.alleles prop.unique.alleles
## Default           4   0.3333333           3           0.5277778
##         heterozygosity
## Default      0.4722222

Accessor functions

There are several functions for getting basic information from a gtypes object:

Subsetting/Indexing

A gtypes object can be subset using the standard R ‘[’ indexing operation, with three slots: [i, j, k]. The first (i) specifies the desired individuals, the second (j) is the loci to return, and the third (k) is the strata. All standard R indexing operations involving numerical, character, or logical vectors work for each argument. For example, to return 10 random individuals:

sub.msats <- msats.fine[sample(nInd(msats.fine), 10), , ]
sub.msats
## 
## <<< gtypes created on 2017-04-10 20:52:02 >>>
## 
## Contents: 10 samples, 5 loci, 2 strata
## 
## Strata summary:
##          num.samples num.missing num.alleles prop.unique.alleles
## Coastal            5         0.2         3.0           0.1333333
## Offshore           5         0.0         6.2           0.6609524
##          heterozygosity
## Coastal            0.71
## Offshore           0.76

…or to return specific loci:

sub.msats <- sub.msats[, c("D11t", "EV37", "TV7"), ]
sub.msats
## 
## <<< gtypes created on 2017-04-10 20:52:02 >>>
## 
## Contents: 10 samples, 2 loci, 2 strata
## 
## Strata summary:
##          num.samples num.missing num.alleles prop.unique.alleles
## Coastal            5         0.5         2.5           0.1666667
## Offshore           5         0.0         8.0           0.8571429
##          heterozygosity
## Coastal           0.575
## Offshore          0.900

…or some loci in a specific stratum:

sub.msats <- msats.fine[, c("Ttr11", "D11t"), "Coastal"]
sub.msats
## 
## <<< gtypes created on 2017-04-10 20:52:02 >>>
## 
## Contents: 68 samples, 2 loci, 1 stratum
## 
## Strata summary:
##         num.samples num.missing num.alleles prop.unique.alleles
## Coastal          68         0.5         3.5                   0
##         heterozygosity
## Coastal      0.5773705

Summary

A summary function has been defined for gtypes, and the return value provides the number of individuals, loci, strata, allele frequencies, a by-strata summary, and a by-locus summary:

msats.smry <- summary(msats.fine)
str(msats.smry)
## List of 8
##  $ num.ind     : int 126
##  $ num.loc     : num 5
##  $ num.strata  : int 2
##  $ unstratified: int 0
##  $ strata.smry : num [1:2, 1:5] 68 58 1.2 0.8 4.8 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "Coastal" "Offshore"
##   .. ..$ : chr [1:5] "num.samples" "num.missing" "num.alleles" "prop.unique.alleles" ...
##  $ allele.freqs:List of 5
##   ..$ D11t : num [1:12, 1:2, 1:2] 0 0 0 0 0 0 48 83 3 0 ...
##   .. ..- attr(*, "dimnames")=List of 3
##   .. .. ..$ : chr [1:12] "117" "119" "121" "127" ...
##   .. .. ..$ : chr [1:2] "freq" "prop"
##   .. .. ..$ : chr [1:2] "Coastal" "Offshore"
##   ..$ EV37 : num [1:22, 1:2, 1:2] 0 0 0 0 0 0 0 1 71 33 ...
##   .. ..- attr(*, "dimnames")=List of 3
##   .. .. ..$ : chr [1:22] "190" "200" "202" "204" ...
##   .. .. ..$ : chr [1:2] "freq" "prop"
##   .. .. ..$ : chr [1:2] "Coastal" "Offshore"
##   ..$ EV94 : num [1:15, 1:2, 1:2] 0 0 0 12 0 47 30 0 0 25 ...
##   .. ..- attr(*, "dimnames")=List of 3
##   .. .. ..$ : chr [1:15] "229" "239" "243" "245" ...
##   .. .. ..$ : chr [1:2] "freq" "prop"
##   .. .. ..$ : chr [1:2] "Coastal" "Offshore"
##   ..$ Ttr11: num [1:9, 1:2, 1:2] 0 0 42 0 0 59 33 2 0 0 ...
##   .. ..- attr(*, "dimnames")=List of 3
##   .. .. ..$ : chr [1:9] "193" "197" "207" "209" ...
##   .. .. ..$ : chr [1:2] "freq" "prop"
##   .. .. ..$ : chr [1:2] "Coastal" "Offshore"
##   ..$ Ttr34: num [1:10, 1:2, 1:2] 0 0 33 19 0 65 8 11 0 0 ...
##   .. ..- attr(*, "dimnames")=List of 3
##   .. .. ..$ : chr [1:10] "179" "183" "185" "187" ...
##   .. .. ..$ : chr [1:2] "freq" "prop"
##   .. .. ..$ : chr [1:2] "Coastal" "Offshore"
##  $ sample.smry :'data.frame':    126 obs. of  5 variables:
##   ..$ id                        : chr [1:126] "18650" "18651" "18652" "18653" ...
##   ..$ strata                    : Factor w/ 2 levels "Offshore","Coastal": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ num.loci.missing.genotypes: int [1:126] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ pct.loci.missing.genotypes: num [1:126] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ pct.loci.homozygous       : num [1:126] 0 0.2 0 0 0.2 0 0 0.2 0.2 0 ...
##  $ locus.smry  : num [1:5, 1:7] 125 119 125 125 126 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "D11t" "EV37" "EV94" "Ttr11" ...
##   .. ..$ : chr [1:7] "num.genotyped" "prop.genotyped" "num.alleles" "allelic.richness" ...
##  - attr(*, "description")= chr "gtypes created on 2017-04-10 20:52:02"
##  - attr(*, "class")= chr [1:2] "gtypeSummary" "list"

For just the by-locus summary statistics:

summarizeLoci(msats.fine)
##       num.genotyped prop.genotyped num.alleles allelic.richness
## D11t            125      0.9920635          12       0.09600000
## EV37            119      0.9444444          22       0.18487395
## EV94            125      0.9920635          15       0.12000000
## Ttr11           125      0.9920635           9       0.07200000
## Ttr34           126      1.0000000          10       0.07936508
##       prop.unique.alleles exptd.heterozygosity obsvd.heterozygosity
## D11t           0.25000000            0.7468273            0.7040000
## EV37           0.13636364            0.8270751            0.6974790
## EV94           0.06666667            0.8304900            0.7760000
## Ttr11          0.22222222            0.7953414            0.7040000
## Ttr34          0.20000000            0.8144248            0.6984127

Stratifying samples

You can specify the stratification scheme when creating a gtypes object as in the examples above. Once a gtypes object has been created, you can also change the stratification scheme by either supplying a new vector for the @strata slot:

# randomly stratify individuals to two populations
strata(msats) <- sample(c("Pop1", "Pop2"), nInd(msats), rep = TRUE)
summary(msats)$strata.smry
##      num.samples num.missing num.alleles prop.unique.alleles
## Pop1          90         0.7        12.2          0.20645743
## Pop2         100         1.3        11.0          0.07973856
##      heterozygosity
## Pop1      0.2185181
## Pop2      0.2484917

or, if there is a stratification scheme data.frame in the @schemes slot, you can use the stratify function to choose a stratification scheme:

# choose "broad" stratification scheme
msats <- stratify(msats, "broad")
summary(msats)$strata.smry
##          num.samples num.missing num.alleles prop.unique.alleles
## Coastal           68         1.2         4.8          0.08571429
## Offshore          58         0.8        13.6          0.17505051
##          heterozygosity
## Coastal       0.6312283
## Offshore      0.8144326

If the @schemes slot is empty (NULL), you can fill it with a proper data.frame like this:

schemes(dloop) <- strata.schemes

NOTE: Filling or changing the @schemes slot does not affect the current stratification of the samples. You must then select a new stratification scheme or fill the @strata slot as above.

stratify(dloop, "fine")
## 
## <<< gtypes created on 2017-04-10 20:52:02 >>>
## 
## Contents: 126 samples, 1 locus, 3 strata
## Stratification schemes: broad, fine
## 
## Strata summary:
##                num.samples num.missing num.alleles prop.unique.alleles
## Coastal                 68           0          68                   1
## Offshore.North          40           0          40                   1
## Offshore.South          18           0          18                   1
##                heterozygosity
## Coastal                     1
## Offshore.North              1
## Offshore.South              1

If some samples should be unstratified (excluded from any stratified analyses), they should have NAs in the appropriate position in the @strata slot. For example:

# unstratify a random 10 samples
x <- strata(msats)
x[sample(indNames(msats), 10)] <- NA
strata(msats) <- x
summary(msats)$strata.smry
##          num.samples num.missing num.alleles prop.unique.alleles
## Coastal           64         1.2         4.8          0.08571429
## Offshore          52         0.8        13.4          0.16682540
##          heterozygosity
## Coastal       0.6333106
## Offshore      0.8122594

You can also randomly permute the current stratification scheme using the permuteStrata function like this:

msats <- stratify(msats, "fine")

# original
summary(msats)$strata.smry
##                num.samples num.missing num.alleles prop.unique.alleles
## Coastal                 68         1.2         4.8          0.08571429
## Offshore.North          40         0.8        12.6          0.22398990
## Offshore.South          18         0.0        11.0          0.25098291
##                heterozygosity
## Coastal             0.6312283
## Offshore.North      0.7901822
## Offshore.South      0.8666667
# permuted
ran.msats <- permuteStrata(msats)
summary(ran.msats)$strata.smry
##                num.samples num.missing num.alleles prop.unique.alleles
## Coastal                 68         0.6        12.0           0.1495948
## Offshore.North          40         1.0        10.4           0.2654762
## Offshore.South          18         0.4         7.0           0.2369048
##                heterozygosity
## Coastal             0.7420318
## Offshore.North      0.6977408
## Offshore.South      0.6541667

NOTE: Only samples assigned to strata are permuted with permuteStrata. Those not assigned (NAs) remain unassigned.

Exporting

The allelic data in a gtypes object can be converted back to a matrix with as.matrix:

gen.mat <- as.matrix(msats)
head(gen.mat)
##       ids     strata           D11t.1 D11t.2 EV37.1 EV37.2 EV94.1 EV94.2
## 18650 "18650" "Coastal"        "131"  "133"  "212"  "222"  "263"  "265" 
## 18651 "18651" "Coastal"        "137"  "143"  "200"  "228"  "249"  "251" 
## 18652 "18652" "Offshore.North" "131"  "133"  "212"  "218"  "249"  "251" 
## 18653 "18653" "Coastal"        "133"  "139"  "202"  "220"  "229"  "245" 
## 18654 "18654" "Coastal"        "131"  "135"  "214"  "216"  "249"  "249" 
## 18655 "18655" "Coastal"        "131"  "137"  "222"  "254"  "243"  "255" 
##       Ttr11.1 Ttr11.2 Ttr34.1 Ttr34.2
## 18650 "197"   "209"   "185"   "187"  
## 18651 "197"   "197"   "185"   "187"  
## 18652 "211"   "213"   "185"   "191"  
## 18653 "197"   "215"   "189"   "195"  
## 18654 "197"   "211"   "185"   "187"  
## 18655 "207"   "211"   "183"   "193"

By default, this function splits each allele into its own column. One can make a matrix with one locus per column and alleles separated by a specified character by setting the one.col argument to TRUE:

gen.mat <- as.matrix(msats, one.col = TRUE)
head(gen.mat)
##       ids     strata           D11t      EV37      EV94      Ttr11    
## 18650 "18650" "Coastal"        "131/133" "212/222" "263/265" "197/209"
## 18651 "18651" "Coastal"        "137/143" "200/228" "249/251" "197/197"
## 18652 "18652" "Offshore.North" "131/133" "212/218" "249/251" "211/213"
## 18653 "18653" "Coastal"        "133/139" "202/220" "229/245" "197/215"
## 18654 "18654" "Coastal"        "131/135" "214/216" "249/249" "197/211"
## 18655 "18655" "Coastal"        "131/137" "222/254" "243/255" "207/211"
##       Ttr34    
## 18650 "185/187"
## 18651 "185/187"
## 18652 "185/191"
## 18653 "189/195"
## 18654 "185/187"
## 18655 "183/193"

The contents of a gtypes object can be written to a file with the write.gtypes function. This will write a .csv file with the allelic information and a .fasta file for any sequence data if it exists.