BIGDAWG

Derek Pappas, Ph.D. (dpappas@chori.org)

2016-08-29

Overview

‘Bridging ImmunoGenomic Data-Analysis Workflow Gaps’ (‘BIGDAWG’) is an integrated analysis system that automates the manual data-manipulation and trafficking steps (the gaps in an analysis workflow) normally required for analyses of highly polymorphic genetic systems (e.g., the immunological human leukocyte antigen (HLA) and killer-cell Immunoglobulin-like receptor (KIR) genes) and their respective genomic data (immunogenomic) (Pappas DJ, Marin W, Hollenbach JA, Mack SJ. 2016. ‘Bridging ImmunoGenomic Data Analysis Workflow Gaps (BIGDAWG): An integrated case-control analysis pipeline.’ Human Immunology. 77:283-287). Starting with unambiguous genotype data for case-control groups, ‘BIGDAWG’ performs tests of Hardy-Weinberg equilibrium, and carries out case-control association analyses for haplotypes, individual loci, and HLA amino acid positions.

Input Data

Data for BIGDAWG should be in a tab delimited text format (UTF-8 encoding). The first row must be a header line and must include column names for genotype data. The first two columns must contain subject IDs and phenotypes (0 = control, 1 = case). While a phenotype maybe include disease status, non-status phenotypes (onset, severity, ancestry, etc.) maybe used. However, phenotype designatons in the dataset are restricted to the use of 0s and 1s. Genotype pairs must be located in adjacent columns. Column names for a given locus must have the same name; do not use ’_1’, ‘.1’, etc.

For HLA alleles, you may choose to format your genotype calls with our without the locus prefix. For example, for HLA-A, a given genotype call maybe 01:01:01:01 or A*01:01:01:01. Allele names can include any level of resolution, from a single field up to the full length name. For HLA-DRB3,-DRB4,-DRB5 genotype calls, you may choose to represent those in a single pair of columns or as separate pairs of columns for each locus. However, when submitted as a single pair of columns, genotypes must be formatted as Locus*Allele (including non-DRB loci). The column names must either be DRB345 or DRB3/4/5. Homozygous or hemizygous status for DRB3, DRB4 and DRB5 genotypes is determined in reference to DRB1 genotypes based on the haplotypes as defined by Andersson, 1998 (Andersson G. 1998. Evolution of the HLA-DR region. Front Biosci. 3:d739-45). If you wish to define your own zygosity, it is suggested you split them into separate pairs of columns for each locus.

Missing Information
When there is missing information, either for lack of genotyping information or absence of genotyped loci, BIGDAWG allows for conventions to differentiate the type of data that is missing.

Data missing due to lack of a molecular genotyping result is considered not available (NA). Acceptable NA strings include: NA, ****, -, na and Na. Empty data cells will be considered NA.

Data missing due to genomic structural variation (i.e., no locus present) is considered absence. Acceptable absence strings include: Absent, absent, Abs, ABS, ab, Ab, AB, ^. The last symbol is the unicode caret symbol. BIGDAWG allows for a special allele name that indicates absence of an HLA locus: 00, 00:00, 00:00:00 and 00:00:00:00 are all acceptable indicators of locus absence. When choosing to use 00’s (zeros) to populate allele name fields, use similar or higher levels of resolution http://hla.alleles.org and follow the same naming convention as with other genotype calls (either with or without locus prefix). If using a single column pair for DRB3/4/5 and the “00” absence indicator, then do NOT affix a locus prefix for the absent calls. In this case, only include the locus prefix for known DRB345 genotypes (i.e., DRB3/4/5*00:00 is NOT an acceptable name). For HLA data, the 00:00 naming convention is preferred and all absent designations will be converted to allow the amino acid analysis to test phenotypic associations with locus absence (see below).

Novel Alleles
BIGDAWG will accept novel allele names. However, it is suggested you follow the same naming convention for novel alleles as with other genotypes calls in your data, either with or without the locus prefix. For example, new alleles could be submitted as follows: New, 01:New, or A*01:New. Unfortunately, the BIGDAWG amino acids analysis cannot accept new allele designations and will display an error. If you would like to run the amino acid analysis, it is suggested you replace the new allele with NA or omit the subject entirely.

Example of data architecture and acceptable values:

SubjectID Disease A A B B DRB1 DRB1 DRB3 DRB3
subject1 0 01:01 02:01 08:01 44:02 01:01 03:01 Abs Abs
subject2 1 02:01 24:02 51:01 51:01 11:01 14:01 02:02 02:11
subject3 0 03:01 26:02 NA NA 10:01 08:01 00:00 00:00

Data Output

After the package is run, BIGDAWG will create a new folder entitled ‘output hhmmss ddmmyy’ in the working directory (unless otherwise specified by Results.Dir parameter, see below). Within the output folder will be a precheck file (‘Data_Summary.txt’) detailing the summary statistics of the dataset and the results of the Hardy-Weinberg equilibrium test (‘HWE.txt’). If any errors are present, a log file (‘Error_Log.txt’) will be written. If no locus subsets are specified (see parameters section), a single subfolder entitled ‘set1’ will contain the outputs of each association analysis run. If multiple locus subsets are optioned, subfolders for each locus set will be created containing the respective analytic results for that subset. Within each set subfolder, a parameter file will detail the parameters that are relevant to that subset.

Data output to both the console and text files can be supressed with Verbose=F and Output=F respectively. This is particularly useful when the user prefers to send the results to an object (Return=T) rather than to text files for further analysis. The result is returned as a list with indices for the Hardy-Weinberg equilibrium test and indices for each result within the Loci.Set (Set1, Set2, etc). If no Loci.Sets were specified, only one list element will exist for Set1.

For an example. If BIGDAWG outputs to an object called ‘BD’ in R (see examples below).

BD
..$HWE - Results of the the Hardy-Weinberg equilibrium test
..$Set1$H - Results of the ‘Haplotype’ analysis for Set1
..$Set1$L - Results of the ‘Locus’ analysis for Set1
..$Set1$A - Results of the ‘Amino Acid’ analysis for Set1
..$Set2...
..$Set3...

names(BD) - Display complete list of available sets to index

If the user chooses to run all pairwise combinations of loci (All.Pairwise=T) for the haplotype analysis, BD$H mentioned above will be a list corresponding to each analysis run for (1) all loci available (1st index in the list) and (2) for each pairwise locus combination (2nd index and greater). For a complete list of available pairwise results, use names(BD$H).

Error Messages and Codes

BIGDAWG has a few built-in checks to ensure data format consistency and compatibility, especially for HLA data. BIGDAWG also does a parameter review before performing chi-squared tests and returns ‘NCalc’ (not calculated) when all genotypes have expected counts < 5 or the degrees of freedom do not allow for a test (e.g., dof < 1).

Parameters

BIGDAWG(Data, HLA=T, Run.Tests, Loci.Set, All.Pairwise=F, Trim=F, Res=2, Missing=0, Cores.Lim=1L, Results.Dir, Return=F, Output=T, Verbose=T)

Data

Class: String. Required. No Default.

e.g., Data="HLA_data" -or- Data="foo.txt"

Specifies genotype data file name. May use file name within working directory or full file name path to specify file location. See Data Input section for details about file formatting. Use ‘Data=HLA_data’ to analyze the bundled synthetic dataset.

HLA

Class: logical. Optional. Default = T.

Indicates whether or not your data is specific for HLA loci. If your data is not HLA, is a mix of HLA and data for other loci, or includes non-standard HLA allele names, you should set HLA = F. This will override the Trim and EVS.rm arguments, and will skip various tests and checks. Set HLA = T if and only if the dataset HLA alleles name are consistent with the most recent IPD-IMGT/HLA database release.

Run.Tests

Class: String or Character vector. Optional. Default = Run all tests.

e.g., Run.Tests = c("L","A") -or- Run.Tests = "HWE"

Specifies which tests to run in analysis. “HWE” will call the Hardy Weinberg Equilibrium test, “H” will call the haplotype association test, “L” will call the locus association test, and “A” will call the amino acid test. Combinations of the test are permitted.

Currently, the amino acid analysis is limited to the HLA-A, -B, -C, -DRB1, -DRB3, -DRB4, -DRB5, -DQA1, -DQB1, -DPA1 and -DPB1 loci.

Loci.Set

Class: List. Optional. Default = Use all loci.

e.g., Loci.Set=list(c("DRB1","DQB1"),c("A","DRB1","DPB1"), c("DRB1","DRB3"))

Input list defining which loci to use for analyses. Combinations are permitted. If you included HLA-DRB3,-DRB4,-DRB5 as a collapsed column pair (‘DRB3/4/5’), you must specifiy the single locus in the Loci.set if you wish them to be included in an analysis set (i.e., ‘DRB3’ NOT ‘DRB3/4/5’).

The pair of alleles for a locus must be in adjacent columns in the analyzed data set. Running multiple sets is ONLY relevant for the haplotype analysis. For all other analyses, loci are treated independently. Consider running haplotype analysis independently when optioning multi-locus sets to avoid redundancy of the other analyses. Each locus set’s output will be contained within a separate set folder numbered numerical (see Data Output section).

All.Pairwise

Class: Logical. Optional. Default = F.

Should pairwise combinations of loci be run in the haplotype analysis? Only relevant to haplotype analysis.

EVS.rm

Class: Logical. Optional. Default = F. (HLA=T specific).

Flags whether or not to strip expression variant suffixes from HLA alleles. Example: A*01:11N will convert to A*01:11. Should not be optioned for data that does not conform to HLA naming conventions.

Trim

Class: Logical. Optional. Default = F. (HLA=T specific).

Flags whether or not to Trim HLA alleles to a specified resolution. Should not be optioned for data that does not conform to HLA naming conventions.

Resolution

Class: Numeric. Optional. Default = 2. (HLA=T specific).

Sets the desired resolution when trimming HLA alleles. Used only when Trim = T. Fields for HLA formatting must follow current colon-delimited nomenclature conventions. Currently, the amino acid analysis will automatically truncate to 2-field resolution. Trimming is automatic and need not be optioned for amino acid analysis to run. This test will not run for data that does not conform to HLA naming conventions.

Missing

Class: String/Numeric. Optional. Default = 0.

Sets the allowable per subject threshold for missing alleles. Relevant to running the haplotype analysis. Effects can be disastrous on processing time for large values of missing. Missing may be set as a number or as “ignore” to skip removal and retain all subjects.

Cores.Lim

Class: Integer. Optional. Default = 1 Core.

Specifies the number of cores accessible by BIGDAWG in amino acid analysis. Not relevant to Windows operating systems which will use only a single core. More than 1 core is best when optioned in command line R. Not recommend for GUIs, e.g. RStudio.

Results.Dir

Class: String. Optional. Default = see Data Output section.

String name of a folder for BIGDAWG output. Subfolder for each locus set will be generated within any output folder specified.

Return

Class: Logical. Optional. Default = F.

Specifies if BIGDAWG should output analysis results to a specified object.

Output

Class: Logical. Optional. Default = T.

Turns on or off the writing of results to files. The default will write all results to text files in the output directory.

Verbose

Class: Logical. Optional. Default = T.

Sets the levels of detail that should be displayed on the console. The default will display summaries of the analysis from each specified test. When turned off, only the completion status of each test is displayed.

Examples

These are examples only and need not be run as defined below.


# Install the BIGDAWG package
install.packages("BIGDAWG")

# Run the full analysis using the example set bundled with BIGDAWG
BIGDAWG(Data="HLA_data")

# Run the haplotype analysis with all pairwise combinations on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="H", All.Pairwise=T)

# Run the Hardy-Weinberg and Locus analysis with non-HLA data while ignoring any missing data on a file called 'data.txt'
BIGDAWG(Data="data.txt", HLA=F, Run.Tests=c("HWE","L"), Missing="ignore")

# Run the amino acid analysis trimming data to 2-Field resolution on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="A", Trim=T, Res=2)

# Run the haplotype analysis with subsets of loci on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="H", Loci.Set=list(c("DRB1","DQB1","DPB1"),c("DRB1","DQB1")))

# Run the full analysis, minimize console output, disable write to file, output to object 'BD'
BD <- BIGDAWG(Data="data.txt", Output=F, Return=T, Verbose=F)

Updating the bundled IMGT/HLA protein alignment

For the amino acid analysis, BIGDAWG is bundled with HLA protein alignment data using theabove indicated database release. These bundled alignments can be updated to the most recent release IMGT/HLA. Future database updates do not guarantee compatability with the updating tool.

# Identify the installed and current release of the bundled IMGT/HLA database release
# Requires active internet connection.
CheckRelease()
CheckRelease(Package=F) # restricts to IMGT/HLA database versions only

# Update to the most recent IMGT/HLA database release
UpdateRelease()

# Force update
UpdateRelease(Force=T)

# Restore to original bundled update.
UpdateRelease(Restore=T)

Updating BIGDAWG to latest developmental versions

Developmental versions of BIGDAWG can be downloaded through GitHub or using the following code (requires R package ‘devtools’). These versions will include the most up-to-date bug fixes as well as access to new features that are still under development. GitHub versions are constantly under development and if you prefer a more stable fixed release, install BIGDAWG from the CRAN respository. You may check BIGDAWG versions using CheckRelease().

# Identify the installed (local), release (CRAN), and developmental (GitHub) versions of BIGDAWG.
# Requires active internet connection.
CheckRelease()
CheckRelease(Alignment=F) # restricts to BIGDAWG versions only

# If 'devtools' package installation is required for installation via GitHub.
install.packages("devtools")

# Load latest BIGDAWG version from GitHub
# May require closing and reopening of R Studio after install
library("devtools")
install_github("pappasd/BIGDAWG", build_vignettes = TRUE)

# For a temporary install of the developmental version
library("devtools")
dev_mode(on=T)
install_github("pappasd/BIGDAWG")
# .... run BIGDAWG analysis
dev_mode(on=F)

End of vignette.