Using MAGNAMWAR: a case study with Drosophila melanogaster

Corinne Sexton

2017-08-25

Introduction

MAGNAMWAR is a software package for bacterial genome wide association (GWA). Relative to standard approaches for GWA, e.g. in humans, bacterial genomes and phenotyping experiments have unique characteristics that suggest the use of different variant calling and statistical approaches may improve the association analysis (reviewed in Power, et al, 2017; Pubmed ID 27840430). MAGNAMWAR enables GWA based on bacterial gene presence or gene variant, permits the use of different statistical modelling approaches, and incorporates population structure into models based on user-defined parameters. Genes are clustered into orthologous groups (OGs) using the OrthoMCL gene clustering software, and can be statistically associated with raw or aggregate (e.g. mean) phenotype data. MAGNAMWAR is especially useful for performing associations when the phenotypes of phylogenetically disparate taxa are analyzed, and the calling of fine-scale variants (e.g. SNPs) is challenging or inappropriate. On the other hand, OrthoMCL gene clustering software may lack resolution when comparing phenotypes between strains from the same bacterial species, and OrthoMCL becomes computationally prohibitive as the number of sampled taxa increases above several hundred isolates. For such implementations, we recommend users consider other existing software reviewed in Power, et al. 2017 that are designed for processing hundreds or thousands of samples; or for performing SNP-based comparisons where it may be more appropriate to consider factors such as linkage.

This vignette describes a recommended workflow to take full advantage of MAGNAMWAR. The data are from a study on bacterial determinants of Drosophila melanogaster triglyceride content, and are representative of any number of datasets that associate one phenotype with one bacterial species. The data included in the package and used in documentation examples are highly subsetted (2 orders of magnitude smaller) from the original dataset to increase speed of the example analyses. Most analyses can be run on a standard laptop computer, although we recommend a desktop computer with at least 16GB RAM for large (>500 phenotype measures) datasets. The functions are presented in their recommended order of operations using the case study fruit fly triglyceride data included in the package.

1. A Brief Outline

The following table outlines the specific input and outputs per function in the order each function should be used. Only the essential functions are listed; optional functions are discussed in their relevant sections.

Function Purpose Input Output Case Study Input Case Study Output
format_MCLfastas prep amino acid fasta files for OrthoMCL analysis 4-letter abbreviated amino acid fasta files of every host-mono-associated organism concatenated fasta file of all inputs, removes any duplicate ids fastas in extdata/fasta_dir/ MCLformatted_all.fasta
format_afterOrtho format output of OrthoMCL clusters to be used with analyses groups file from OrthoMCL Parsed data contained in a list of 2 objects (presence absence matrix, protein ids) extdata/groups_ example_r.txt after_ortho_format_grps
AnalyzeOrthoMCL main analysis of data format_afterOrtho output; phenotypic data a matrix with 7 variables (cluster group, p-value, corrected p-value…) after_ortho_format_grps; pheno_data mcl_mtrx_grps
join_repseq appends representative sequences with AnalyzeOrthoMCL matrix format_afterOrtho output; fasta files; output matrix of AnalyzeOrthoMCL a data frame of the joined matrix after_ortho_format_grps; extdata/fasta_dir/; mcl_mtrx_grps joined_mtrx_grps

2. Format For OrthoMCL Gene Clustering

The first step is to assign each gene to a cluster of orthologous groups. This pipeline uses OrthoMCL for clustering, either through a local install (requires extensive RAM; see supplementary data) or web-based executable (http://orthomcl.org/orthomcl/; max 100,000 sequences). OrthoMCL software requires specific fasta sequence header formats and that there are no duplicate protein ids. The fasta header format contains 2 pipe-separated pieces of information: a unique 3-4 alphanumeric taxon designation and a unique protein ID classifier, e.g. >apoc|WP_000129691. Two functions aid in file formatting: FormatMCLFastas, which formats genbank files, and RASTtoGBK, which formats files from RAST.

FormatMCLFastas

FormatMCLFastas converts amino-acid fasta files to an OrthoMCL compliant format and combines them into a concatenated file called “MCLformatted_all.fasta”. All amino acid fasta files must first be placed in a user-specified directory and given a 3-4 letter alphanumeric name, beginning with a non-numeric character (e.g. aac1.fasta). Example fasta files are included in the MAGNAMWAR package. The output “MCLformatted_all.fasta” file is the input for the OrthoMCL clustering software.

RASTtoGBK

For fasta files that are not in the NCBI format, they should be converted in a separate folder to an NCBI-compatible format before running FormatMCLFastas. For example, files in the RAST format have the initial header >fig|unique_identifer; not >ref|xxxx|xxxx|unique_identifier|annotation. To convert these or any other files to NCBI-compatible format, use the RASTtoGBK function to merge the annotation using the unique identifier as a lookup, specifying the name of the fasta file (with a 3-4 letter alphanumeric name beginning with a non-numeric character), path to the reference file containing the annotation, and the output folder. If the reference file bearing the annotation is from RAST the lookup file can be downloaded as the ‘spreadsheet (tab separated text format)’. If the reference file is from a different source, format it so that the unique identifier is in the 2nd column and the reference annotation is in the 8th column.

lfrc_fasta <- system.file('extdata', 'RASTtoGBK//lfrc.fasta', package='MAGNAMWAR')
lfrc_reference <- system.file('extdata', 'RASTtoGBK//lfrc_lookup.csv', package='MAGNAMWAR')
lfrc_path <- system.file('extdata', 'RASTtoGBK//lfrc_out.fasta', package='MAGNAMWAR')

RASTtoGBK(lfrc_fasta,lfrc_reference,lfrc_path)

2.5 Call Gene Clusters with OrthoMCL Software

After formatting fasta files, run OrthoMCL clustering software either locally or online. OrthoMCL documentation describes how local installations can vary the inflation factor parameter to adjust the resolution of gene clustering.

More information about OrthoMCL clustering software can be found at: http://www.orthomcl.org/.

3. Format Clusters for Statistical Analysis

FormatAfterOrtho

The FormatAfterOrtho function reformats the direct output from OrthoMCL for the MAGNAMWAR pipeline. To use the command, call FormatAfterOrtho, specifying the location of the OrthoMCL clusters file (online: OrthologGroups file; local: output of orthomclMclToGroups command) and whether the software was run online (“ortho”; default) or locally (“groups”). The following sample data were derived using local clustering.

# file_groups is the file path to your output file from OrthoMCL

parsed_data <- FormatAfterOrtho(file_groups, "groups")

The new stored variable is a list of matrices. The first matrix is a presence/absence matrix of taxa that bear each cluster of orthologous groups (OGs). The second matrix lists the specific protein ids within each cluster group. The data in each can be accessed by calling parsed_data[[1]] or parsed_data[[2]], respectively, although this is not necessary for subsequent pipeline steps. Because it contains the specific protein ids for each OG, this list is a key input for most subsequent functions.

a subset of resulting variable parsed_data:

  1. OG presence/absence matrix:

parsed_data[[1]][,1:13]

##           aace apap apas atro bsub dmos ecol efao efav ehor geur gfra ghan
## ex_r00000 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00001 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00002 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00003 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "0"  "1" 
## ex_r00004 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00005 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00006 "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00007 "1"  "1"  "1"  "1"  "1"  "0"  "1"  "1"  "1"  "1"  "1"  "1"  "1" 
## ex_r00008 "0"  "1"  "1"  "0"  "1"  "0"  "1"  "1"  "1"  "1"  "1"  "0"  "0" 
## ex_r00009 "1"  "1"  "0"  "0"  "1"  "0"  "1"  "0"  "0"  "1"  "0"  "0"  "1" 
## ex_r00010 "0"  "1"  "0"  "1"  "1"  "0"  "1"  "1"  "1"  "1"  "1"  "1"  "1"
  1. Protein IDs in each OG:

parsed_data[[2]][,1:2]

##     ex_r00000                ex_r00001               
## V2  "aace|ZP_08699022.1"     "bsub|NP_389730.2"      
## V3  "apap|940282.4.peg.2315" "bsub|NP_390261.1"      
## V4  "apas|ZP_16282535.1"     "aace|ZP_08696506.1"    
## V5  "atro|ZP_08644184.1"     "apap|940282.4.peg.1501"
## V6  "bsub|NP_391359.1"       "apas|ZP_16281966.1"    
## V7  "dmos|ZP_08468863.1"     "atro|ZP_08644394.1"    
## V8  "ecol|NP_415408.1"       "dmos|ZP_08469662.1"    
## V9  "efao|YP_005708185.1"    "ecol|NP_414920.1"      
## V10 "efav|NP_815059.1"       "efao|YP_005708912.1"   
## V11 "ehor|ZP_08496675.1"     "efav|NP_816072.1"      
## V12 "geur|ZP_08902778.1"     "ehor|ZP_08497259.1"    
## V13 "gfra|ZP_11376413.1"     "geur|ZP_08902509.1"    
## V14 "ghan|ZP_06835021.1"     "gfra|ZP_11376782.1"    
## V15 "gobo|ZP_08896945.1"     "ghan|ZP_06833288.1"    
## V16 "gxyl|YP_004867578.1"    "gobo|ZP_08898428.1"    
## V17 "lani|ZP_08549469.1"     "gxyl|YP_004868985.1"   
## V18 "lbre|ZP_03940694.1"     "lani|ZP_08548797.1"    
## V19 "lbuc|YP_004398648.1"    "lbuc|YP_004398815.1"   
## V20 "lcas|YP_006751102.1"    "lcas|YP_006752056.1"   
## V21 "lfal|ZP_08312985.1"     "lfal|ZP_08312677.1"    
## V22 "lfer|ZP_03944435.1"     "lfer|ZP_03944015.1"    
## V23 "lfru|ZP_08652182.1"     "lfru|ZP_08652638.1"    
## V24 "llac|1358.26.peg.2055"  "llac|1358.26.peg.706"  
## V25 "lmai|ZP_09447100.1"     "lmai|ZP_09447014.1"    
## V26 "lmal|ZP_09441330.1"     "lmal|ZP_09441246.1"    
## V27 "lpla|YP_004888740.1"    "lpla|YP_004888561.1"   
## V28 "lrha|YP_003170666.1"    "lrha|YP_003171609.1"   
## V29 "lver|ZP_09442552.1"     "lver|ZP_09443795.1"    
## V30 "pbur|ZP_16361594.1"     "pbur|ZP_16363149.1"    
## V31 "pput|YP_001266156.1"    "pput|YP_001270272.1"   
## V32 "smar|AMF-0004114"       "smar|AMF-0001686"      
## V33 "spar|YP_006309044.1"    "spar|YP_006310046.1"   
## V34 "swit|YP_001264509.1"    "swit|YP_001264948.1"   
## V35 "efao|YP_005709159.1"    "efao|YP_005707743.1"   
## V36 "efav|NP_816368.1"       "efav|NP_814698.1"      
## V37 "lfer|ZP_03944497.1"     "llac|1358.26.peg.28"   
## V38 "lver|ZP_09443619.1"     ""

4. Perform Metagenome-wide Association

AnalyzeOrthoMCL

The AnalyzeOrthoMCL function performs the statistical tests to compare the phenotypes of taxa bearing or lacking each OG. It requires the R object output of the FormatAfterOrtho function and a phenotype matrix containing the variables for the statistical tests. Seven different tests are supported, each deriving the significance of OG presence/absence on a response variable, and specified by the following codes:

To run AnalyzeOrthoMCL, the following parameters are required:

  1. the output of FormatAfterOrtho
  2. a data frame (NOT a path) of phenotypic data*
  3. model name (as described above)
  4. name of column containing 4-letter taxa designations
  5. column names of certain variables depending on which model is specified:
    • lm: resp
    • wx: resp
    • lmeR1: resp, rndm1
    • lmeR2ind: resp, rndm1, rndm2
    • lmeR2nest: resp, rndm1, rndm2
    • lmeF2: resp, rndm1, rndm2, fix2
    • survmulti: time, event, rndm1, rndm2, multi
    • survmulticensor: time, time2, event, rndm1, rndm2, fix2, multi, output_dir

In order to create a data frame for your phenotypic data use the following command where file_path is the path to your phenotypic data file: (the file_path used in this case study is not available because pheno_data exists as an .rda file)*

pheno_data <- read.table(file_path, header = TRUE, sep = ',')

A subset of the phenotypic file for TAG content pheno_data is shown below:

##   Treatment  RespVar Vial Experiment
## 1      apoc 2.821429    A          A
## 2      apoc 5.500000    B          A
## 3      apoc 8.464286    C          A
## 4      jacg 5.392857    A          A
## 5      jacg 2.821429    B          A

To run AnalyzeOrthoMCL use the headers within pheno_data to specify variables:

It is not necessary to specify the OG presence/absence variable, which is automatically populated from the FormatAfterOrtho output. All other variables must be specified using column names in the phenotype matrix.

We will thus populate AnalyzeOrthoMCL using the headers within pheno_data to specify variables:

For example, to test the effect of gene presence/absence on fat content using a mixed model with nested random effects, the following command should be used:

mcl_matrix <- AnalyzeOrthoMCL(parsed_data, 
                               pheno_data, 
                               model = 'lmeR2nest', 
                               species_name = 'Treatment',  
                               resp = 'RespVar', 
                               rndm1 = 'Experiment', 
                               rndm2 = 'Vial')

An optional but highly recommended parameter of AnalyzeOrthoMCL is the princ_coord parameter, which allows the user to incorporate population structure. The options for this parameter are 1, 2, 3, or a decimal. Numbers 1-3 specify how many principal coordinates to include as fixed effects in the model and a decimal specifies to use as many principal coordinates as needed for that decimal percentage of the variance to be accounted for. When a user specifies one of these options, the software automatically calculates a principal coordinates matrix from the FormatAfterOrtho presence absence matrix, and incorporated the specified number of principal coordinates into the model See below in the QQ Plot section to see an example of plotted analyses using principal coordinates in the model, and how incorporating population structure into the models can improve the statistical distribution of the results.

The resulting output matrix contains 7 columns:

  1. OG - Clustered orthologous group name from OrthoMCL
  2. pval1 - p-value of test
  3. corrected_pval1 - Bonferroni corrected p-value
  4. mean_OGContain - mean of phenotypic data value of all taxa in OG
  5. mean_OGLack - mean of phenotypic data value of all taxa not in OG
  6. taxa_contain - taxa in OG
  7. taxa_miss - taxa not in OG

mcl_mtrx[,1:3] Showing the first three columns of mcl_matrix.

##       OG          pval1                  corrected_pval1      
##  [1,] "ex_r00000" "0.0680152959317386"   "0.748168255249125"  
##  [2,] "ex_r00001" "0.0680152959317386"   "0.748168255249125"  
##  [3,] "ex_r00002" "0.000169433279973097" "0.00186376607970407"
##  [4,] "ex_r00003" "0.217474911482261"    "1"                  
##  [5,] "ex_r00004" "0.4482634415059"      "1"                  
##  [6,] "ex_r00005" "0.0899794361093273"   "0.989773797202601"  
##  [7,] "ex_r00007" "0.999790593888518"    "1"                  
##  [8,] "ex_r00008" "0.00669725842125168"  "0.0736698426337685" 
##  [9,] "ex_r00009" "0.186956618632854"    "1"                  
## [10,] "ex_r00010" "0.00824936629366202"  "0.0907430292302822"

Survival analysis example

AnalyzeOrthoMCL also provides for analysis using a survival model. A survival analysis is a method for calculating the difference between two treatments on time-series data, which may not always be normally distributed and is most likely to be relevant to host, but perhaps not bacterial, phenotypes. Users should provide a data frame of their phenotypic data, similar to the data frame described above for the other models. For more information see the survival package, which includes several useful guides and vignettes (https://CRAN.R-project.org/package=survival).

Briefly, for each individual in a population that reaches a benchmark event, such as death, the time of death is recorded in the ‘time’ column, and a ‘1’ is recorded in the ‘event’ column, indicating the individual died at time X. If an individual left an experiment prematurely (e.g. moved away from a location where a study was conducted, a fly escaped from a vial before death), the event column is labelled ‘0’ to indicate it survived until that point in the experiment. Other metadata, including the sample name, are included on the same row as these 2 data points under other columns in a matrix.

##   EXP  VIAL BACLO  TRT t1       t2 event
## 1   1 01A-6     1 lbrc  0 26.33333     1
## 2   1 01A-7     1 lbrc  0 19.00000     1
## 3   1 01A-8     1 lbrc  0 26.73077     1

Because each individual is specified individually, survival analyses are often conducted on large datasets with thousands of measurements. To expedite the use of survival analyses in iterative BGWA testing, our survival analysis can be multi-threaded to run on multiple cores, using the multi option. The survmulticensor option is included to break up the tests for further parallelization. This function allows the user to optionally input a startnum and stopnum signifying which and how many tests to run. The output of those certain tests can then be written into the output_dir where the SurvAppendMatrix function can be used to concatenate all small tests together.

mcl_mtrx <- AnalyzeOrthoMCL(after_ortho_format, starv_pheno_data, species_name = 'TRT', model='survmulti', time='t2', event='event', rndm1='EXP', rndm2='VIAL', multi=1)

5. Post-statistical analysis

JoinRepSeq

JoinRepSeq randomly selects a representative protein annotation and amino acid sequence from each OG and appends it to the AnalyzeOrthoMCL output matrix. The purpose is to identify OG function. The result is a data frame bearing 4 additional variables:

  1. rep_taxon - taxon that contains the representative sequence
  2. rep_id - representative protein id
  3. rep_annot - representative protein annotation
  4. rep_seq - representative protein sequence

Four inputs are specified for join\_repseq:

  1. The list produced by FormatAfterOrtho
  2. The fasta directory of the 4-letter abbreviated fasta files
  3. The matrix produced by AnalyzeOrthoMCL
  4. The NCBI sequence format (“old”(default) or “new”)
dir <- system.file('extdata', 'fasta_dir', package='MAGNAMWAR')
dir <- paste(dir,'/',sep='')
joined_matrix <- JoinRepSeq(after_ortho_format_grps, dir, mcl_mtrx_grps, fastaformat = 'old')
## 
## picking and writing representative sequence for PDG:
## 9.09%__18.18%__27.27%__36.36%__45.45%__54.55%__63.64%__72.73%__81.82%__90.91%__

joined_matrix[1,8:10] An example of three of the appended columns are shown below:

##   rep_taxon      rep_id                                        rep_annot
## 1      efav NP_816368.1  thioredoxin reductase/glutathione-like protein

6. Exporting Data

MAGNAMWAR statistical analysis can be exported into either tab-separated matrices or into graphical elements.

6a. Matrices

PrintOGSeqs

Allows the user to output all protein sequences in a specified OG in fasta format.

SurvAppendMatrix

When survival models are used in the MGWA a single .csv file is printed for each test. SurvAppendMatrix joins each individual file into one complete matrix.

WriteMCL

Writes the matrix result from AnalyzeOrthoMCL into a tab-separated file.

6b. Graphics

MAGNAMWAR also offers several options for graphical outputs. The several ways to visualize data are explained in detail below.

PDGPlot

PDGPlot visualizes the phenotypic effect of bearing a OG. The phenotype matrix is presented as a bar chart, and gene presence/absence is represented by different bar shading.

Calling PDGPlot requires the same pheno_data variable as AnalyzeOrthoMCL and similarly takes advantage of the user specified column names from the pheno_data data frame. It also requires the mcl_matrix object and specification of which OG to highlight.

For example, to visualize the means and standard deviations of the taxa which are present in OG “ex_r00002”, the OG with the lowest corrected p-value (“0.00186”). Green and gray bars represent taxa that do or do not contain a gene in the OG, respectively.

PDGPlot(pheno_data, mcl_matrix, OG = 'ex_r00002', species_colname = 'Treatment', data_colname = 'RespVar', ylab = "TAG Content")

The different taxa can be ordered alphabetically, as above, or by specifying the order with either the tree parameter (for phylogenetic sorting) or the order parameter, which takes a vector to determine order. For example, the taxa can be ordered by phylogeny by calling a phylogenetic tree file with the taxon names as leaves (note any taxa in pheno_data that are not in the tree file will be omitted).

tree <- system.file('extdata', 'muscle_tree2.dnd', package='MAGNAMWAR')

PDGPlot(pheno_data, mcl_matrix, 'ex_r00002', 'Treatment', 'RespVar', ylab = "TAG Content", tree = tree)

PhyDataError

PhyDataError is similar to PDGPlot and adds visualization of phylogenetic tree with the phenotypic means and standard deviation.

Calling PhyDataError requires the following parameters:

  • a phylogenetic tree
  • the same pheno_data variable as AnalyzeOrthoMCL (and its column names)
  • the mcl_matrix object
  • a specification of which OG to highlight
PhyDataError(tree, pheno_data, mcl_matrix, species_colname = 'Treatment', data_colname = 'RespVar', OG='ex_r00002', xlabel='TAG Content')

pdg_v_OG

PDGvOG produces a histogram of the number of OGs in each phylogenetic distribution group (PDG). A PDG is the set of OGs present and absent in the exact same set of bacterial taxa. The main purpose of the graph is to determine the fraction of OGs that are present in unique or shared sets of taxa, since the phenotypic effect of any OGs in the same PDG cannot be discriminated from each other.

To run PDGvOG, provide the output of FormatAfterOrtho. The num parameter (default 40) is used to specify the amount of OGs per PDG that should be included on the x axis.

For example, in the graph below there are 11 PDGs only contain one OG, meaning that 11 PDGs have one group that has a unique distribution of taxa present and absent.

PDGvOG(parsed_data,0)

Because this data is an extreme subset, it isn’t very informative. A full data set is shown below. This shows us that in this particular OrthoMCL clustering data, 4822 PDGs exist with only 1 unique OG in the PDG, and around 500 PDGs exist with 2 OGs that share the same presence and absence of taxa and so on.

QQPlotter

A simple quartile-quartile plot function that is generated using the matrix of AnalyzeOrthoMCL. To run QQPlotter, provide the output matrix of AnalyzeOrthoMcl.

QQPlotter(mcl_matrix)

Using this function, we provide a visualization of the use of the princ_coord parameter in AnalyzeOrthoMCL. Notice how each additional principal coordinate reduces statistical inflation. In practice it is usually computationally inexpensive to run several iterations of the analysis with a different number of principal coordinates to test how their incorporation improves the statistical fit.