Creating and Manipulating gtypes Objects

Eric Archer

2016-12-05

Raw data

It is easiest to load raw genotype data from the disk if it is available as a text file, usually in comma-delimited (.csv) format. The standard R functions read.table or read.csv can be used to accomplish this. However, in strataG, the readGenData function has been provided for .csv files, which is a wrapper for read.csv that sets commonly used values for missing data and removes blank lines.

gen.data <- readGenData("msats.csv")
str(gen.data)
## 'data.frame':    126 obs. of  12 variables:
##  $ ids    : chr  "4495" "4496" "4498" "5814" ...
##  $ strata : chr  "Offshore.North" "Offshore.North" "Offshore.North" "Offshore.North" ...
##  $ D11t.1 : chr  "133" "137" "133" "135" ...
##  $ D11t.2 : chr  "133" "141" "133" "139" ...
##  $ EV37.1 : chr  NA NA "220" "210" ...
##  $ EV37.2 : chr  NA NA "220" "226" ...
##  $ EV94.1 : chr  NA "243" "245" "251" ...
##  $ EV94.2 : chr  NA "251" "265" "265" ...
##  $ Ttr11.1: chr  "211" NA "209" "209" ...
##  $ Ttr11.2: chr  "211" NA "211" "209" ...
##  $ Ttr34.1: chr  "185" "189" "189" "185" ...
##  $ Ttr34.2: chr  "191" "189" "189" "187" ...

For sequence data stored in FASTA format, the read.fasta function is available, which is a wrapper for the read.dna function in the ape package with standard FASTA arguments set. This will create a DNAbin object in the workspace:

fname <- system.file("extdata/dolph.seqs.fasta", package = "strataG")
x <- read.fasta(fname) 
x
## 126 DNA sequences in binary format stored in a list.
## 
## All sequences of same length: 402 
## 
## Labels: 4495 4496 4498 5814 5815 5816 ...
## 
## Base composition:
##     a     c     g     t 
## 0.301 0.229 0.129 0.341

For sequences stored in other formats, read.dna should be used directly.

Construction

For most functions in strataG, you will need to load your data into a gtypes object. A gtypes object is an R S4 class with the following slots:

new constructor

A gtypes object can be created with the standard S4 new constructor function, like this:

#--- create a diploid (microsatellite) gtypes object
data(dolph.msats)
data(dolph.strata)
strata.schemes <- dolph.strata[, c("broad", "fine")]
rownames(strata.schemes) <- dolph.strata$id
msats <- new("gtypes", gen.data = dolph.msats[, -1], ploidy = 2,
  ind.names = dolph.msats[, 1], schemes = strata.schemes)

#--- create a haploid sequence (mtDNA) gtypes object
data(dolph.seqs)
dloop.haps <- cbind(haps = dolph.strata$id)
rownames(dloop.haps) <- dolph.strata$id
dloop <- new("gtypes", gen.data = dloop.haps, ploidy = 1, 
  sequences = dolph.seqs)

Both of these examples create a gtypes where every individual is in a single default stratum. To stratify samples when created, supply either a vector, factor, or stratification scheme name to the ‘strata’ argument in new, like so:

msats.fine <- new("gtypes", gen.data = dolph.msats[, -1], ploidy = 2,
  ind.names = dolph.msats[, 1], schemes = strata.schemes, strata = "fine")

df2gtypes - Convert a data.frame or matrix

The df2gtypes function assumes that you have a matrix or data.frame with columns for individual ids, stratification, and locus data. You then specify the columns in the data.frame where this information can be found. df2gtypes can be used for data with multiple alleles per locus, like this:

# create a single data.frame with the msat data and stratification
msats.merge <- merge(dolph.strata, dolph.msats, all.y = TRUE, description = date())
str(msats.merge)
## 'data.frame':    126 obs. of  14 variables:
##  $ id     : chr  "18650" "18651" "18652" "18653" ...
##  $ dLoop  : chr  "Hap.06" "Hap.14" "Hap.23" "Hap.24" ...
##  $ broad  : chr  "Offshore" "Offshore" "Offshore" "Offshore" ...
##  $ fine   : chr  "Offshore.South" "Offshore.South" "Offshore.South" "Offshore.South" ...
##  $ D11t.1 : chr  "131" "137" "131" "133" ...
##  $ D11t.2 : chr  "133" "143" "133" "139" ...
##  $ EV37.1 : chr  "212" "200" "212" "202" ...
##  $ EV37.2 : chr  "222" "228" "218" "220" ...
##  $ EV94.1 : chr  "263" "249" "249" "229" ...
##  $ EV94.2 : chr  "265" "251" "251" "245" ...
##  $ Ttr11.1: chr  "197" "197" "211" "197" ...
##  $ Ttr11.2: chr  "209" "197" "213" "215" ...
##  $ Ttr34.1: chr  "185" "185" "185" "189" ...
##  $ Ttr34.2: chr  "187" "187" "191" "195" ...
# create the gtypes object
msats.fine <- df2gtypes(msats.merge, ploidy = 2, id.col = 1, strata.col = 3, loc.col = 5)

…or for haploid data, like this:

data(dolph.seqs)

seq.df <- dolph.strata[ c("id", "broad", "id")]
colnames(seq.df)[3] <- "D-loop"
dl.g <- df2gtypes(seq.df, ploidy = 1, sequences = dolph.seqs)
dl.g
## 
## <<< gtypes created on 2016-12-05 16:15:19 >>>
## 
## Contents: 126 samples, 1 locus, 2 strata
## 
## Strata summary:
##          num.samples num.missing num.alleles prop.unique.alleles
## Coastal           68           0          68                   1
## Offshore          58           0          58                   1
##          heterozygosity
## Coastal               1
## Offshore              1
## 
## Sequence summary:
##        num.seqs min.length mean.length max.length         a         c
## D.loop      126        402         402        402 0.3011288 0.2293531
##                g         t
## D.loop 0.1289702 0.3405479

Note that since each sequence in dolph.seqs is for a given individual, the pct.unique.alleles and heterozygosity values are for every stratum 1 and num.samples equals num.alleles. In order to convert the sequences to unique haplotypes, use the labelHaplotypes function:

dl.haps <- labelHaplotypes(dl.g)
dl.haps$gtype
## 
## <<< gtypes created on 2016-12-05 16:15:19 >>>
## 
## Contents: 126 samples, 1 locus, 2 strata
## 
## Strata summary:
##          num.samples num.missing num.alleles prop.unique.alleles
## Coastal           68           0           5           0.0000000
## Offshore          58           0          29           0.4827586
##          heterozygosity
## Coastal       0.7427568
## Offshore      0.9594676
## 
## Sequence summary:
##        num.seqs min.length mean.length max.length         a         c
## D.loop       33        402         402        402 0.3011738 0.2293071
##                g         t
## D.loop 0.1291935 0.3403256

sequence2gtypes - Convert DNA sequences

The sequence2gtypes function creates an unstratified gtype object with just a set of DNA sequences:

data(dolph.haps)

haps.g <- sequence2gtypes(dolph.haps)
haps.g
## 
## <<< gtypes created on 2016-12-05 16:15:20 >>>
## 
## Contents: 33 samples, 1 locus, 1 stratum
## 
## Strata summary:
##         num.samples num.missing num.alleles prop.unique.alleles
## Default          33           0          33                   1
##         heterozygosity
## Default              1
## 
## Sequence summary:
##       num.seqs min.length mean.length max.length         a         c
## gene1       33        402         402        402 0.3011738 0.2293071
##               g         t
## gene1 0.1291935 0.3403256

If you have a vector that identifies strata designations for the sequences, that can be supplied as well:

# extract and name the stratification scheme
strata <- dolph.strata$fine
names(strata) <- dolph.strata$ids

# create the gtypes object
dloop.fine <- sequence2gtypes(dolph.seqs, strata, seq.names = "dLoop",
  description = "dLoop: fine-scale stratification")
dloop.fine
## 
## <<< dLoop: fine-scale stratification >>>
## 
## Contents: 126 samples, 1 locus, 3 strata
## 
## Strata summary:
##                num.samples num.missing num.alleles prop.unique.alleles
## Coastal                 68           0          68                   1
## Offshore.North          40           0          40                   1
## Offshore.South          18           0          18                   1
##                heterozygosity
## Coastal                     1
## Offshore.North              1
## Offshore.South              1
## 
## Sequence summary:
##       num.seqs min.length mean.length max.length         a         c
## dLoop      126        402         402        402 0.3011288 0.2293531
##               g         t
## dLoop 0.1289702 0.3405479

Note that stratification is generally provided for individuals, thus if you want to stratify the resulting gtypes object from sequence2gtypes, one sequence for each individual should be provided, rather than just a set of unique haplotypes.

genind2gtypes - Convert genind object

This converts from a genind object from the adegenet package.

library(adegenet)
# from example(df2genind)
df <- data.frame(locusA=c("11","11","12","32"),
                 locusB=c(NA,"34","55","15"),
                 locusC=c("22","22","21","22"))
row.names(df) <- .genlab("genotype",4)
obj <- df2genind(df, ploidy=2, ncode=1)
obj
## /// GENIND OBJECT /////////
## 
##  // 4 individuals; 3 loci; 9 alleles; size: 6.1 Kb
## 
##  // Basic content
##    @tab:  4 x 9 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 2-4)
##    @loc.fac: locus factor for the 9 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: df2genind(X = df, ncode = 1, ploidy = 2)
## 
##  // Optional content
##    - empty -
# convert to gtypes
gi.g <- genind2gtypes(obj)
gi.g
## 
## <<< gtypes created on 2016-12-05 16:15:20 >>>
## 
## Contents: 4 samples, 3 loci, 1 stratum
## 
## Strata summary:
##         num.samples num.missing num.alleles prop.unique.alleles
## Default           4   0.3333333           3           0.5277778
##         heterozygosity
## Default      0.4722222
## 
## Locus summary:
##        num.genotyped num.alleles prop.unique.alleles obsvd.heterozygosity
## locusA             4           3           0.3333333            0.5000000
## locusB             3           4           0.7500000            0.6666667
## locusC             4           2           0.5000000            0.2500000

Accessor functions

There are several functions for getting basic information from a gtypes object:

Subsetting/Indexing

A gtypes object can be subset using the standard R ‘[’ indexing operation, with three slots: [i, j, k]. The first (i) specifies the desired individuals, the second (j) is the loci to return, and the third (k) is the strata. All standard R indexing operations involving numerical, character, or logical vectors work for each argument. For example, to return 10 random individuals:

sub.msats <- msats.fine[sample(nInd(msats.fine), 10), , ]
sub.msats
## 
## <<< gtypes created on 2016-12-05 16:15:19 >>>
## 
## Contents: 10 samples, 5 loci, 2 strata
## 
## Strata summary:
##          num.samples num.missing num.alleles prop.unique.alleles
## Coastal            9           0         3.8           0.2466667
## Offshore           1           0         1.8           1.0000000
##          heterozygosity
## Coastal       0.6444444
## Offshore      0.8000000
## 
## Locus summary:
##       num.genotyped num.alleles prop.unique.alleles obsvd.heterozygosity
## D11t             10           4                 0.5                  0.7
## EV37             10           3                 0.0                  0.6
## EV94             10           5                 0.4                  0.6
## Ttr11            10           6                 0.5                  0.8
## Ttr34            10           5                 0.4                  0.6

…or to return specific loci:

sub.msats <- sub.msats[, c("D11t", "EV37", "TV7"), ]
sub.msats
## 
## <<< gtypes created on 2016-12-05 16:15:19 >>>
## 
## Contents: 10 samples, 2 loci, 2 strata
## 
## Strata summary:
##          num.samples num.missing num.alleles prop.unique.alleles
## Coastal            9           0           3           0.1666667
## Offshore           1           0           2           1.0000000
##          heterozygosity
## Coastal       0.6111111
## Offshore      1.0000000
## 
## Locus summary:
##      num.genotyped num.alleles prop.unique.alleles obsvd.heterozygosity
## D11t            10           4                 0.5                  0.7
## EV37            10           3                 0.0                  0.6

…or some loci in a specific stratum:

sub.msats <- msats.fine[, c("Ttr11", "D11t"), "Coastal"]
sub.msats
## 
## <<< gtypes created on 2016-12-05 16:15:19 >>>
## 
## Contents: 68 samples, 2 loci, 1 stratum
## 
## Strata summary:
##         num.samples num.missing num.alleles prop.unique.alleles
## Coastal          68         0.5         3.5                   0
##         heterozygosity
## Coastal      0.5773705
## 
## Locus summary:
##       num.genotyped num.alleles prop.unique.alleles obsvd.heterozygosity
## Ttr11            68           4                   0            0.6323529
## D11t             67           3                   0            0.5223881

Summary

A summary function has been defined for gtypes, and the return value provides the number of individuals, loci, strata, allele frequencies, a by-strata summary, and a by-locus summary:

msats.smry <- summary(msats.fine)
str(msats.smry)
## List of 7
##  $ num.ind     : num 126
##  $ num.loc     : int 5
##  $ num.strata  : int 2
##  $ unstratified: int 0
##  $ allele.freqs:List of 5
##   ..$ D11t : num [1:12, 1:2, 1:2] 0 0 0 0 0 0 48 83 3 0 ...
##   .. ..- attr(*, "dimnames")=List of 3
##   .. .. ..$ : chr [1:12] "117" "119" "121" "127" ...
##   .. .. ..$ : chr [1:2] "freq" "prop"
##   .. .. ..$ : chr [1:2] "Coastal" "Offshore"
##   ..$ EV37 : num [1:22, 1:2, 1:2] 0 0 0 0 0 0 0 1 71 33 ...
##   .. ..- attr(*, "dimnames")=List of 3
##   .. .. ..$ : chr [1:22] "190" "200" "202" "204" ...
##   .. .. ..$ : chr [1:2] "freq" "prop"
##   .. .. ..$ : chr [1:2] "Coastal" "Offshore"
##   ..$ EV94 : num [1:15, 1:2, 1:2] 0 0 0 12 0 47 30 0 0 25 ...
##   .. ..- attr(*, "dimnames")=List of 3
##   .. .. ..$ : chr [1:15] "229" "239" "243" "245" ...
##   .. .. ..$ : chr [1:2] "freq" "prop"
##   .. .. ..$ : chr [1:2] "Coastal" "Offshore"
##   ..$ Ttr11: num [1:9, 1:2, 1:2] 0 0 42 0 0 59 33 2 0 0 ...
##   .. ..- attr(*, "dimnames")=List of 3
##   .. .. ..$ : chr [1:9] "193" "197" "207" "209" ...
##   .. .. ..$ : chr [1:2] "freq" "prop"
##   .. .. ..$ : chr [1:2] "Coastal" "Offshore"
##   ..$ Ttr34: num [1:10, 1:2, 1:2] 0 0 33 19 0 65 8 11 0 0 ...
##   .. ..- attr(*, "dimnames")=List of 3
##   .. .. ..$ : chr [1:10] "179" "183" "185" "187" ...
##   .. .. ..$ : chr [1:2] "freq" "prop"
##   .. .. ..$ : chr [1:2] "Coastal" "Offshore"
##  $ strata.smry : num [1:2, 1:5] 68 58 1.2 0.8 4.8 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "Coastal" "Offshore"
##   .. ..$ : chr [1:5] "num.samples" "num.missing" "num.alleles" "prop.unique.alleles" ...
##  $ locus.smry  : num [1:5, 1:7] 125 119 125 125 126 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "D11t" "EV37" "EV94" "Ttr11" ...
##   .. ..$ : chr [1:7] "num.genotyped" "prop.genotyped" "num.alleles" "allelic.richness" ...
##  - attr(*, "description")= chr "gtypes created on 2016-12-05 16:15:19"
##  - attr(*, "class")= chr [1:2] "gtypeSummary" "list"

For just the by-locus summary statistics:

summarizeLoci(msats.fine)
##       num.genotyped prop.genotyped num.alleles allelic.richness
## D11t            125      0.9920635          12       0.09600000
## EV37            119      0.9444444          22       0.18487395
## EV94            125      0.9920635          15       0.12000000
## Ttr11           125      0.9920635           9       0.07200000
## Ttr34           126      1.0000000          10       0.07936508
##       prop.unique.alleles exptd.heterozygosity obsvd.heterozygosity
## D11t           0.25000000            0.7468273            0.7040000
## EV37           0.13636364            0.8270751            0.6974790
## EV94           0.06666667            0.8304900            0.7760000
## Ttr11          0.22222222            0.7953414            0.7040000
## Ttr34          0.20000000            0.8144248            0.6984127

Stratifying samples

You can specify the stratification scheme when creating a gtypes object as in the examples above. Once a gtypes object has been created, you can also change the stratification scheme by either supplying a new vector for the @strata slot:

# randomly stratify individuals to two populations
strata(msats) <- sample(c("Pop1", "Pop2"), nInd(msats), rep = TRUE)
summary(msats)$strata.smry
##      num.samples num.missing num.alleles prop.unique.alleles
## Pop1          64         1.2        13.0           0.2431235
## Pop2          62         0.8        11.2           0.1489216
##      heterozygosity
## Pop1      0.7269048
## Pop2      0.7049001

or, if there is a stratification scheme data.frame in the @schemes slot, you can use the stratify function to choose a stratification scheme:

# choose "broad" stratification scheme
msats <- stratify(msats, "broad")
summary(msats)$strata.smry
##          num.samples num.missing num.alleles prop.unique.alleles
## Coastal           68         1.2         4.8          0.08571429
## Offshore          58         0.8        13.6          0.17505051
##          heterozygosity
## Coastal       0.6312283
## Offshore      0.8144326

If the @schemes slot is empty (NULL), you can fill it with a proper data.frame like this:

schemes(dloop) <- strata.schemes

NOTE: Filling or changing the @schemes slot does not affect the current stratification of the samples. You must then select a new stratification scheme or fill the @strata slot as above.

stratify(dloop, "fine")
## 
## <<< gtypes created on 2016-12-05 16:15:19 >>>
## 
## Contents: 126 samples, 1 locus, 3 strata
## Stratification schemes: broad, fine
## 
## Strata summary:
##                num.samples num.missing num.alleles prop.unique.alleles
## Coastal                 68           0          68                   1
## Offshore.North          40           0          40                   1
## Offshore.South          18           0          18                   1
##                heterozygosity
## Coastal                     1
## Offshore.North              1
## Offshore.South              1
## 
## Sequence summary:
##      num.seqs min.length mean.length max.length         a         c
## haps      126        402         402        402 0.3011288 0.2293531
##              g         t
## haps 0.1289702 0.3405479

If some samples should be unstratified (excluded from any stratified analyses), they should have NAs in the appropriate position in the @strata slot. For example:

# unstratify a random 10 samples
x <- strata(msats)
x[sample(indNames(msats), 10)] <- NA
strata(msats) <- x
summary(msats)$strata.smry
##          num.samples num.missing num.alleles prop.unique.alleles
## Coastal           63         1.0         4.8          0.08571429
## Offshore          53         0.8        13.6          0.20838384
##          heterozygosity
## Coastal       0.6255261
## Offshore      0.8121631

You can also randomly permute the current stratification scheme using the permuteStrata function like this:

msats <- stratify(msats, "fine")

# original
summary(msats)$strata.smry
##                num.samples num.missing num.alleles prop.unique.alleles
## Coastal                 68         1.2         4.8          0.08571429
## Offshore.North          40         0.8        12.6          0.22398990
## Offshore.South          18         0.0        11.0          0.25098291
##                heterozygosity
## Coastal             0.6312283
## Offshore.North      0.7901822
## Offshore.South      0.8666667
# permuted
ran.msats <- permuteStrata(msats)
summary(ran.msats)$strata.smry
##                num.samples num.missing num.alleles prop.unique.alleles
## Coastal                 68         1.6        12.2           0.1300000
## Offshore.North          40         0.4        10.8           0.2245455
## Offshore.South          18         0.0         8.0           0.3095238
##                heterozygosity
## Coastal             0.7197669
## Offshore.North      0.7117949
## Offshore.South      0.7111111

NOTE: Only samples assigned to strata are permuted with permuteStrata. Those not assigned (NAs) remain unassigned.

Exporting

The allelic data in a gtypes object can be converted back to a matrix with as.matrix:

gen.mat <- as.matrix(msats)
head(gen.mat)
##      ids    strata           D11t.1 D11t.2 EV37.1 EV37.2 EV94.1 EV94.2
## 4495 "4495" "Offshore.North" "133"  "133"  NA     NA     NA     NA    
## 4496 "4496" "Offshore.North" "137"  "141"  NA     NA     "243"  "251" 
## 4498 "4498" "Offshore.North" "133"  "133"  "220"  "220"  "245"  "265" 
## 5814 "5814" "Offshore.North" "135"  "139"  "210"  "226"  "251"  "265" 
## 5815 "5815" "Offshore.North" "139"  "139"  "212"  "212"  "249"  "261" 
## 5816 "5816" "Offshore.North" "135"  "137"  "214"  "220"  "249"  "253" 
##      Ttr11.1 Ttr11.2 Ttr34.1 Ttr34.2
## 4495 "211"   "211"   "185"   "191"  
## 4496 NA      NA      "189"   "189"  
## 4498 "209"   "211"   "189"   "189"  
## 5814 "209"   "209"   "185"   "187"  
## 5815 "207"   "213"   "185"   "185"  
## 5816 "207"   "209"   "187"   "195"

By default, this function splits each allele into its own column. One can make a matrix with one locus per column and alleles separated by a specified character by setting the one.col argument to TRUE:

gen.mat <- as.matrix(msats, one.col = TRUE)
head(gen.mat)
##      ids    strata           D11t      EV37      EV94      Ttr11    
## 4495 "4495" "Offshore.North" "133/133" NA        NA        "211/211"
## 4496 "4496" "Offshore.North" "137/141" NA        "243/251" NA       
## 4498 "4498" "Offshore.North" "133/133" "220/220" "245/265" "209/211"
## 5814 "5814" "Offshore.North" "135/139" "210/226" "251/265" "209/209"
## 5815 "5815" "Offshore.North" "139/139" "212/212" "249/261" "207/213"
## 5816 "5816" "Offshore.North" "135/137" "214/220" "249/253" "207/209"
##      Ttr34    
## 4495 "185/191"
## 4496 "189/189"
## 4498 "189/189"
## 5814 "185/187"
## 5815 "185/185"
## 5816 "187/195"

The contents of a gtypes object can be written to a file with the write.gtypes function. This will write a .csv file with the allelic information and a .fasta file for any sequence data if it exists.