AGHmatrix: An R package to compute relationship matrices for diploid and autopolyploid species

Rodrigo Amadeu - rramadeu at ufl dot edu
Patricio Munoz
Horticultural Sciences Department - University of Florida

2019-03-26

Overview

AGHmatrix software is an R-package to build relationship matrices using pedigree (A matrix) and/or molecular markers (G matrix) with the possibility to build a combined matrix of Pedigree corrected by Molecular (H matrix). The package works with diploid and autopolyploid Data.

If you are not familiar with R, we recommend the reading of vignette Introduction to R.

Pedigree-based relationship matrix (A matrix)

        Additive         Non-Additive
Diploid         Henderson (1976)         Cockerham (1954)
Polyploid         Kerr (2012), Slater (2013)        

Molecular-based relationship matrix (G matrix)

        Additive         Non-Additive
Diploid         Yang (2010), VanRaden (2012)         Su (2012), Vitezica (2013)
Polyploid         Slater (2016), VanRaden (2012)         Slater (2016), Endelman (2018)

Citation

To cite this R package:

Amadeu, R. R., C. Cellon, J. W. Olmstead, A. A. F. Garcia, M. F. R. Resende, and P. R. Muñoz. 2016. AGHmatrix: R Package to Construct Relationship Matrices for Autotetraploid and Diploid Species: A Blueberry Example. The Plant Genome 9(30). doi: 10.3835/plantgenome2016.01.0009

Installing the Package

Within R:

## Installing and loading devtools
install.packages("devtools")
library(devtools)

## Installing from github the fullsibQTL
install_github("prmunoz/AGHmatrix")

To load the package

library(AGHmatrix)

Relationship matrices using the pedigree data - A matrix

In this section we presented how to load the data into the software and how to construct the pedigree-based relationship matrix (A-matrix) for diploid and autotetraploid species. In the package, the function Amatrix handles the pedigree and build the A-matrix related to that given pedigree. The matrix is build according to the recursive method presented in Mrode (2014) and described by Henderson (1976). This method is expanded for higher ploidy (n-ploidy) according with Kerr et al. (2012). After loading the package you have to load your data file into the software. To do this, you can use the function read.data() or read.csv() (If specifically the format .csv file is used) for example. Your data should be available in R as a dataframe structure in the following order: column 1 must be the individual/genotype names (id), columns 2 and 3 must be the parent names. In the package there is a pedigree data example (ped.mrode) that can be used to look at the structure and order the data.

To load ped.mrode:

data(ped.mrode)
ped.mrode
#>    Ind Par1 Par2
#> 1 Anc1    0    0
#> 2 Anc2    0    0
#> 3 Var1 Anc1 Anc2
#> 4 Var2 Anc1    0
#> 5 Var3 Var2 Var1
#> 6 Var4 Var3 Anc2

The example ped.mrode has 3 columns, column 1 contains the names of the individual/genotypes, column 2 contains the names of the first parent, column 3 contains the names of the second parental. There is no header and the unknown value must be equal 0. Your data has to be in the same format of ped.mrode. In the algorithm, the first step is the pre- processing of the pedigree: the individuals are numerated \(1\) to \(n\). Then, it is verified whether the genotypes in the pedigree are in chronological order (i.e. if the parents of a given individual are located prior to this individual in the pedigree dataset). If this order is not followed, the algorithm performs the necessary permutations to correct them. After this pre-processing, the matrices computation proceeds as in Mrode (2014) for diploid - for additive or dominance relationship - and as in Kerr et al. (2012) for autotetraploids - for additive relationship. For autotetraploids there is the option to include double-reduction fraction (as Slater et al., 2014). For diploids there is the option to compute the non-additive relationship matrix (Cockerham, 1954).

Computing additive relationship matrix for diploids:

Amatrix(ped.mrode, ploidy=2)
#> Verifying conflicting data... 
#> Organizing data... 
#> Your data was chronologically organized with success. 
#> Constructing matrix A using ploidy = 2 
#> Completed! Time = 1.666667e-05  minutes
#>      Anc1  Anc2   Var1   Var2   Var3   Var4
#> Anc1 1.00 0.000 0.5000 0.5000 0.5000 0.2500
#> Anc2 0.00 1.000 0.5000 0.0000 0.2500 0.6250
#> Var1 0.50 0.500 1.0000 0.2500 0.6250 0.5625
#> Var2 0.50 0.000 0.2500 1.0000 0.6250 0.3125
#> Var3 0.50 0.250 0.6250 0.6250 1.1250 0.6875
#> Var4 0.25 0.625 0.5625 0.3125 0.6875 1.1250

It follows some usage examples with the ped.mrode.

#Computing non-additive relationship matrix considering diploidy:
Amatrix(ped.mrode, ploidy=2, dominance=TRUE)

#Computing additive relationship matrix considering autotetraploidy:
Amatrix(ped.mrode, ploidy=4)

#Computing additive relationship matrix considering autooctaploidy:
Amatrix(ped.mrode, ploidy=8)

#Computing additive relationship matrix considering autotetraploidy and double-reduction of 10%:
Amatrix(ped.mrode, ploidy=4, w=0.1)

#Computing additive relationship matrix considering autohexaploidy and double-reduction of 10%:
Amatrix(ped.mrode, ploidy=6, w=0.1)

More information about Amatrix can be found with:

?Amatrix

Relationship matrices using the molecular data - G matrix for diploids

This section presents how to load the data and how to construct the genomic-based relationship matrix for diploid species. In the package, the function Gmatrix is the one that handles the molecular-markers matrix and builds the relationship matrix. Molecular markers data should be organized in a matrix format (individual in rows and markers in columns) coded as 0,1,2 and missing data value (numeric or NA). Import your molecular marker data into R with the function read.table() and convert to a matrix format with the function as.matrix(). The function Gmatrix can be used then to construct the additive relationship either as proposed by Yang et al. (2010) or the proposed by VanRaden (2008). The function can also construct the dominance relationship matrix either as proposed by Su et al. (2012) or as proposed by Vitezica et al. (2013). As an example, here we build the four matrices using real data from Resende et al. (2012).

To load snp.pine and to check its structure:

data(snp.pine)
str(snp.pine)
#>  int [1:926, 1:4853] -9 2 2 2 2 2 2 2 2 2 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : chr [1:926] "1087120" "1085618" "1091040" "1091686" ...
#>   ..$ : chr [1:4853] "0-10024-01-114" "0-10037-01-257" "0-10040-02-394" "0-10040-02-41" ...

In this dataset we have 926 individuals with 4853 markers and the missing data value is -9.

It follows some usage examples with the snp.pine data where the unknown value (missingValue) is -9. Here we set minimum allele frequency to 0.05.

#Computing the additive relationship matrix based on VanRaden 2008
G_VanRaden <- Gmatrix(SNPmatrix=snp.pine, missingValue=-9, maf=0.05, method="VanRaden")

#Computing the additive relationship matrix based on Yang 2010
G_Yang <- Gmatrix(SNPmatrix=snp.pine, missingValue=-9, maf=0.05, method="Yang")

#Computing the dominance relationship matrix based on Su 2012
G_Su <- Gmatrix(SNPmatrix=snp.pine, missingValue=-9, maf=0.05, method="Su")

#Computing the dominance relationship matrix based on Vitezica 2013
G_Vitezica <- Gmatrix(SNPmatrix=snp.pine, missingValue=-9, maf=0.05, method="Vitezica")

More information about Gmatrix can be found with:

?Gmatrix

Relationship matrices using the molecular data - G matrix for autopolyploids

This section presents how to load the data and how to construct the genomic-based relationship matrix for autopolyploid species. In the package, the function Gmatrix is the one that handles the molecular-markers matrix and builds the relationship matrix. Molecular markers data should be organized in a matrix format (individual in rows and markers in columns) coded according to the dosage level: 0,1,2,…,ploidy level, and missing data value (numeric or NA). As an example, an autohexaploid should be coded as 0,1,2,3,4,5,6, and missing data value. Import your molecular marker data into R with the function read.table() and convert to a matrix format with the function as.matrix(). In autopolyploids, the function Gmatrix can be used then to construct: i) the additive relationship based on VanRaden (2008) and extended by Ashraf (2016) and also implemented in the sommer package (Covarrubias-Pazaran 2016); ii) the full-autopolyploid including additive and non-additive model as equations 8 and 9 in Slater et al. (2016); iii) the pseudo-diploid model as equations 5, 6, and 7 Slater et al. (2016). As an example, here we build the three matrices using simulated data.

Simulating data with 10 individuals and 100 markers.

inds <- 10
markers <- 100
markersdata <- matrix(sample(x=0:4, size=inds*markers, replace=TRUE), nrow=inds, ncol=markers)

Computing the matrices:

#Computing the additive relationship matrix based on VanRaden 2008 and adapted by Ashraf 2016
G_VanRaden <- Gmatrix(markersdata, method="VanRaden", ploidy=4)

#Computing the dominance (digenic) matrix based on Endelman 2018 (Eq. 19)
G_Digenic <- Gmatrix(markersdata, method="Endelman", ploidy=4)

#Computing the full-autopolyploid matrix based on Slater 2016 (Eq. 8 and 9)
G_FullAutopolyploid <- Gmatrix(markersdata, method="Slater", ploidy=4)

#Computing the pseudodiploid matrix based on Slater 2016 (Eq. 5, 6, and 7)
G_Pseudodiploid <- Gmatrix(markersdata, method="VanRaden", ploidy=4, pseudo.diploid=TRUE) 

More information about Gmatrix can be found with:

?Gmatrix

Combined relationship matrix - H matrix

Here we present how to compute the H matrix using the package. This matrix is computed based on the matrix A using matrix G as a correction and it is presented in Munoz et al. 2014. In this example we consider the A matrix as an identity (e.g we do not have pedigree information and assume the individuals in Hardy-Weinberg equilibrium). It is important to pass to Hmatrix the same arguments of missingValue and maf of Gmatrix. If the Hmat gives singularity problem, you can try to add a small constant to its calculus changing argument c to a small value (e.g, 0.001).

#Computing the additive relationship matrix based on Yang 2010
Gmat <- Gmatrix(SNPmatrix=snp.pine, missingValue=-9, maf=0.05, method="VanRaden")

#Setting an A matrix structure
Amat <- diag(926) #we have 926 individuals

#A matrix has to have the same row and col names of G
rownames(Amat) <- colnames(Amat) <- rownames(Gmat)

#Computing H matrix
Hmat <- Hmatrix(A=Amat, G=Gmat, markers=snp.pine, missingValue=-9, maf=0.05)

Covariance matrices due to epistatic terms

Here we present how to easily compute the epistasis relationship matrices using Hadamard products (i.e. cell-by-cell product), denoted by *. For more information please see Muñoz et al. (2014). In this example we are using the molecular-based relationship matrix. First, build the additive and dominance matrices:

A <- Gmatrix(SNPmatrix=snp.pine, method="VanRaden", missingValue=-9, maf=0.05)
D <- Gmatrix(SNPmatrix=snp.pine, method="Vitezica", missingValue=-9,maf=0.05)

For the first degree epistatic terms:

#Additive-by-Additive Interactions
A_A <- A*A
#Dominance-by-Additive Interactions
D_A <- D*A
#Dominance-by-Dominance Interactions
D_D <- D*D

For the second degree epistatic terms:

#Additive-by-Additive-by-Additive Interactions
A_A_A <- A*A*A
#Additive-by-Additive-by-Dominance Interactions
A_A_D <- A*A*D
#Additive-by-Dominance-by-Dominance Interactions
A_D_D <- A*D*D
#Dominance-by-Dominance-by-Dominance Interactions
D_D_D <- D*D*D

And so on…

Inverting your matrix to be used in mixed models equations:

To invert your matrix you can use the R base function solve if your matrix has an unique inverse:

#Computing the matrix
A <- Amatrix(data=ped.mrode, ploidy=4, w=0.1)

#Building its unique inverse
Ainv <- solve(A)

However, it is common a genomic-based relationship matrix not be positive-definite and, therefore, has non-unique inverse. In such case, you may use one generalized inverse computation or cholesky decomposition. Here we show how to compute the Moore-Penrose inverse implemented in the MASS package:

#Computing the matrix
Gmat <- Gmatrix(SNPmatrix=snp.pine, missingValue=-9, maf=0.05, method="VanRaden")

#Building its Moore-Penrose generalized inverse
Ginv <- MASS::ginv(Gmat)

If you are still having trouble to invert the polyploid A matrix we recommend you to take a look at Hamilton & Kerr (2017).

Exporting your data to be used in ASReml - csv format

In this section, we present how to use the function formatmatrix to export a recently build matrix in the format compatible with ASReml standalone version. That is the lower diagonal matrix formatted in three columns in .csv format (other ASCII extension could be used as well). In order to do this, we need to build a matrix, its inverse, and export it using formatmatrix function. ASReml can invert the relationship matrix as well, probably more efficiently than R for large matrices (i.e. solve() function), so no need to invert the matrix in R if matrix is large. This function has as options: round.by, which let you decide the number of decimals you want; exclude.0, if TRUE, remove all the zeros from your data; and, name that defines the name to be used in the exported file. Use the default if not sure what parameter use in these function. Here an example using ped.mrode data:

#Loading the data example
data(ped.mrode)

#Computing the matrix
A <- Amatrix(data=ped.mrode, ploidy=4, w=0.1)

#Building its inverse
Ainv <- solve(A)

#Exporting it. The function "formatmatrix" will convert it and save in your working directory
formatmatrix(Ainv, round.by=12, exclude.0=TRUE, name="Ainv")

Making a loop in order to get several matrices

In this section, we present a simple for function for the user to be able to obtain in a practical way several matrices for different double reduction values (if polyploidy) to later be used in ASReml (for example). Similar to one made in Amadeu et al. (2016).

#Loading the data example
data(ped.mrode)

#Determining a double reduction range
double.red<-seq(0,0.2,0.05)

#Extracting the length of double.red
n<-length(double.red)

#Making the loop
for(i in 1:n){
  A<-Amatrix(data=ped.mrode,
             ploidy=4,
             w=double.red[i])
  #Computing the inverse
  A.inv<-solve(A)
  
  #Exporting as csv
  formatmatrix(data=A.inv,
               name=paste("Ainv_",double.red[i],sep=""),
               round.by=12,
               exclude.0=TRUE)
}

Bibliography

Amadeu, R. R., C. Cellon, J. W. Olmstead, A. A. Garcia, M. F. Resende, and P. R. Muñoz, 2016 AGHmatrix: R package to construct relationship matrices for autotetraploid and diploid species: a blueberry example. The Plant Genome 9.

Ashraf, B. H., S. Byrne, D. Fé, A. Czaban, T. Asp, M. G. Pedersen, I. Lenk, N. Roulund, T. Didion, C. S. Jensen, et al., 2016 Estimating genomic heritabilities at the level of family-pool samples of perennial ryegrass using genotyping-by-sequencing. Theoretical and Applied Genetics 129: 45-52.

Covarrubias-Pazaran, G. 2016 Genome-assisted prediction of quantitative traits using the R package sommer. PloS one 11.6: e0156744.

Endelman, J. B., et al., 2018. Genetic variance partitioning and genome-wide prediction with allele dosage information in autotetraploid potato. Genetics, 209(1) pp. 77-87.

Hamilton, M. G., R. J. Kerr, 2017 Computation of the inverse additive relationship matrix for autopolyploid and multiple-ploidy populations. Theoretical and Applied Genetics. https://doi.org/10.1007/s00122-017-3041-y

Henderson, C., 1976 A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics pp. 69–83.

Kerr, R. J., L. Li, B. Tier, G. W. Dutkowski, and T. A. McRae, 2012 Use of the numerator relation ship matrix in genetic analysis of autopolyploid species. Theoretical and Applied Genetics 124: 1271–1282.

Mrode, R. A., 2014 Linear models for the prediction of animal breeding values. Cabi. 3rd ed.

Muñoz, P. R., M. F. Resende, S. A. Gezan, M. D. V. Resende, G. de los Campos, M. Kirst, D. Huber,and G. F. Peter, 2014 Unraveling additive from nonadditive effects using genomic relationship matrices. Genetics 198: 1759–1768.

R Core Team, 2016 R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Resende, M. F., P. Muñoz, M. D. Resende, D. J. Garrick, R. L. Fernando, J. M. Davis, E. J. Jokela ,T. A. Martin, G. F. Peter, and M. Kirst, 2012 Accuracy of genomic selection methods in a standard data set of loblolly pine (pinus taeda l.). Genetics 190: 1503–1510.

Slater, A. T., G. M. Wilson, N. O. Cogan, J. W. Forster, and B. J. Hayes, 2014 Improving the analysis of low heritability complex traits for enhanced genetic gain in potato. Theoretical and applied genetics 127: 809–820.

Su, G., O. F. Christensen, T. Ostersen, M. Henryon, and M. S. Lund, 2012 Estimating additive and non-additive genetic variances and predicting genetic merits using genome-wide dense single nucleotide polymorphism markers. PloS one 7: e45293.

VanRaden, P., 2008 Efficient methods to compute genomic predictions. Journal of dairy science 91: 4414–4423.

Vitezica, Z. G., L. Varona, and A. Legarra, 2013 On the additive and dominant variance and covariance of individuals within the genomic selection scope. Genetics 195: 1223–1230.

Yang, J., B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders, D. R. Nyholt, P. A. Madden, A. C. Heath, N. G. Martin, G. W. Montgomery, et al., 2010 Common snps explain a large proportion of the heritability for human height. Nature genetics 42: 565–569.

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.10
#> 
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] AGHmatrix_0.0.5 rmarkdown_1.10  knitr_1.20     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.0       codetools_0.2-15 lattice_0.20-35  digest_0.6.18   
#>  [5] rprojroot_1.3-2  grid_3.5.1       backports_1.1.2  magrittr_1.5    
#>  [9] evaluate_0.12    stringi_1.3.1    Matrix_1.2-14    tools_3.5.1     
#> [13] stringr_1.4.0    yaml_2.2.0       compiler_3.5.1   htmltools_0.3.6