GROAN package allows to assess the performances of one or more genomic regressors when artificial noise is injected in an existing dataset. It is possible, however, to use GROAN to simply compare the prediction accuracy of different regression models without the injection of extra noise. All tests are perfomed through crossvalidations, which is managed transparently by GROAN, with the user required to input the number of folds.
The general workflow is the following:
GROAN package does not directly contain regressors, but leverages those implemented in the following packages:
If you want to test your own regressor or one taken from another package you need to write a wrapper function, as detailed here.
If you want to customize the type of injected noise you need to write a noise injector function, as detailed here.
In the rest of this document we refer to the PEA dataset to illustrate GROAN functionalities. PEA is a dataset of 103 pea lines, genotyped for 647 SNPs coming from genotyping-by-sequencing (GBS). The dataset contains also a single phenotypic trait (grain yield, measured in t/ha). The dataset is split in three variables:
library(GROAN)
#array of phenotypes
GROAN.pea.yield
#dataframe of SNP genotypes
GROAN.pea.SNPs
#dataframe of realized genotypic kinship
GROAN.pea.kinship
For more details on the data format please refer to the package help. For more details on the dataset in general please refer to:
Annicchiarico et al., GBS-Based Genomic Selection for Pea Grain Yield under Severe Terminal Drought, The Plant Genome, Volume 10.
A NoisyDataset is an object containing an instance of the original (noiseless) dataset and a reference to a noise injector function. It is created using the createNoisyDataset function. If no noise injector function is specified, no noise will be added.
#creating a NoisyDataset without any extra noise injected
nds.no_noise = createNoisyDataset(
name = 'PEA, no noise',
genotypes = GROAN.pea.SNPs,
phenotypes = GROAN.pea.yield
)
#creating a NoisyDataset adding noise sampled from a normal distribution
nds.normal_noise = createNoisyDataset(
name = 'PEA, normal noise',
genotypes = GROAN.pea.SNPs,
phenotypes = GROAN.pea.yield,
noiseInjector = noiseInjector.norm,
mean = 0,
sd = sd(GROAN.pea.yield) * 0.5
)
The noise is injected on demand each time the function getNoisyPhenotype is invoked:
#plotting the original phenotypes
plot(GROAN.pea.yield, pch=20, main = 'True (black) vs. Noisy (red)', xlab = 'Samples', ylab = 'Phenotypes')
#plotting an instance of the phenotypes with noise injected
points(getNoisyPhenotype(nds.normal_noise), col='red')
Noise injectors usually leverage a stochastic mechanism to add noise. Therefore, successive calls to getNoisyPhenotype produce (slighly) different results.
#average correlation oscillates around 0.89
cor(GROAN.pea.yield, getNoisyPhenotype(nds.normal_noise))
cor(GROAN.pea.yield, getNoisyPhenotype(nds.normal_noise))
cor(GROAN.pea.yield, getNoisyPhenotype(nds.normal_noise))
## [1] 0.8973586
## [1] 0.9076192
## [1] 0.9131571
Obviously in absence of noise no variability is present.
#no noise injector ==> the original phenotypes are returned
all(GROAN.pea.yield == getNoisyPhenotype(nds.no_noise))
## [1] TRUE
The following noise injectors are implemented in GROAN:
Function name | Effect |
---|---|
noiseInjector.dummy | No effect (used for test purposes) |
noiseInjector.norm | Injects normal (gaussian) noise |
noiseInjector.swapper | Swaps phenotypes between samples |
noiseInjector.unif | Injects uniform noise |
It is possible to extend GROAN functionalities by writing your own noise injector, as detailed here.
A Workbench is an object describing both the regressor(s) that will be tested and some configuration regarding the test itself. A Workbench object is created using the createWorkbench function. The main parameters of interest are:
In its simplest form, a Workbench can be created using all defaults. So this call:
#creating a Workbench with default values
wb = createWorkbench()
is equivalent to the following one, where all default parameters are explicitly assigned:
#creating a Workbench with default values explicitly assigned
wb = createWorkbench(
#parameters defining crossvalidation
folds = 5, reps = 10, stratified = FALSE,
#parameters defining save-on-hard-disk policy
outfolder = NULL, saveHyperParms = FALSE, saveExtraData = FALSE,
#a regressor
regressor = phenoRegressor.rrBLUP, regressor.name = 'rrBLUP'
)
Actually, the default regressor.name is “default regressor”, but it was changed in the example for clarity.
It is possible to update the Workbench object by adding other regressors using the addRegressor function:
#adding a regressor to an existing Workbench
wb = addRegressor(
#the Workbench to be updater
wb,
#the new regressor
regressor = phenoRegressor.BGLR, regressor.name = 'Bayesian Lasso',
#regressor-specific parameters
type = 'BL'
)
The previous operation can be repeated as many times as necessary. Just remember to actually update the Workbench instance, and to assign sensible names to the regressors (they will be used, later, to investigate the results).
The following regressors are implemented in GROAN:
Function name | Effect |
---|---|
phenoRegressor.BGLR | Regressors from BGLR package (Bayes A, B, C, Bayesian Lasso, G-BLUP, RKHS, …) |
phenoRegressor.dummy | A dummy function returning random values, for test/development purposes. |
phenoRegressor.RFR | Random forest regression from package randomForest (caution: very slow) |
phenoRegressor.rrBLUP | Regressor from rrBLUP package, implements SNP blup with ridge-regression-like solution |
phenoRegressor.SVR | Support Vector Regression from package e1071. It can operate in three modes: a) no training (an existing SVR object is needed), b) training (when hyperparameters are known) and c) tuning (when hyperpareters need to be optimized in a grid search, very slow) |
It is possible to extend GROAN functionalities by writing your own regressor, as detailed here
To run GROAN you need a NoisyDataset object and a Workbench object. Once you have them the actual test is executed using the GROAN.run function.
The result is a data frame containing one row per regressor times the number of repetitions. Each row represents a full crossvalidation of one regressor on the dataset. Columns contain measures of correlation between real and predicted phenotypes, measures of error, and execution time. The same data frame can be saved on hard disk by setting parameter outfolder in the Workbench. Details are reported in GROAN.run documentation.
For reporting purposes it is possible to collate together (and filter) several result data frames, as shown here:
#executing two GROAN test, same workbench, different data set
res.no_noise = GROAN.run(nds.no_noise, wb)
res.normal_noise = GROAN.run(nds.normal_noise, wb)
#putting the results together for further analysis
res.total = rbind(res.no_noise, res.normal_noise)
The main tool used for inspecting regression results is the plotResult function, which produces plots. The function leverages the ggplot2 package and returns a ggplot object. It is possible to select the performance metrix to be plotted, together with the type of plot.
#defaults is a boxplot of Pearson's correlations
p = plotResult(res.total)
print(p)
#a barplot with 95% confidence interval of Pearson's correlations
p = plotResult(res.total, plot.type = 'bar_conf95')
print(p)
#a barplot of execution times per fold, in seconds
p = plotResult(res.total, plot.type = 'bar', variable = 'time')
print(p)
The data frame returned by GROAN.run contains the basic performance evaluation operated by GROAN. Several extra information can be stored on hard disk for later examination. This is done setting the appropriate policy on data saving when creating the Workbench object. The following options are available:
Mode | How | What |
---|---|---|
Don’t save anything | outfolder=NULL | Nothing will be saved on hard disk. |
Save summary | outfolder=‘some/path’ | A subfolder will be created in the passed path, named after the run.id, containing an “accuracy.csv” file. |
…and hyperparameters | outfolder=‘some/path’ saveHyperParms=TRUE |
Same as above, but save also hyperpameters from regressor tuning in csv format. |
…and extra data | outfolder=‘some/path’ saveExtraData=TRUE |
Same as above, but save also extra data from regressor tuning in R format. |
Hyperparameters are (usually numeric) values tuned during training phase. Not all regressors require hyperparameter tuning, and the exact nature of hyperparameters depends on the regressor.
Extra data are saved using R save function, and can be accessed using the load function. One extra data bundle is save for each regressor times crossvalidation folds times experiment repetitions. Each bundle will usually contain at least the tuned model, but details depend on regressor function.
If none of the available noise injectors satisfies your needs it is possible to write and use a custom injector in your experiments. This is done in two steps: 1) write the injector 2) include it in a NoisyDataSet object.
Custom injectors need to accept at least one argument called “phenotypes”. When GROAN calls the injector this argument will be a numeric array containing the phenotypes that will be used to train the regressor(s). The function is required to return a numeric array of the same size as “phenotypes”, with its values added of noise. For the sake of arguments let’s suppose we want to investigate the effects of a bias noise acting on a half of the data, as illustrated by the following code:
#A noise injector adding a fixed bias to a random subset of about 50% of the data
my.noiseInjector.bias = function(phenotypes){
#an array of random booleans
labels = runif(length(phenotypes)) > 0.5
#adding the bias (a fixed value of 5)
phenotypes[labels] = phenotypes[labels] + 5
#returning the original phenotypes with added noise
return(phenotypes)
}
To use the injector we just defined we need to include it in the definition of a NoisyDataSet:
#A NoisyDataSet that embeds the bias noise
nds.bias_noise = createNoisyDataset(
name = 'PEA, bias noise',
genotypes = GROAN.pea.SNPs,
phenotypes = GROAN.pea.yield,
noiseInjector = my.noiseInjector.bias #the function we defined above
)
It is possible for the noise injector function to have extra arguments other than phenotypes. In the following version of the bias noise injector the amount of bias is an input from the user:
#An improved version of the above function, this time the bias is not fixed
my.noiseInjector.bias2 = function(phenotypes, bias = 5){
#an array of random booleans
labels = runif(length(phenotypes)) > 0.5
#adding the bias (from the function argument)
phenotypes[labels] = phenotypes[labels] + bias
#returning the original phenotypes with added noise
return(phenotypes)
}
The call to NoisyDataSet can now assign values to the noise injector arguments:
#A NoisyDataSet with bias noise function, using the second version of the function
nds.bias_noise2 = createNoisyDataset(
name = 'PEA, bias noise, second function',
genotypes = GROAN.pea.SNPs,
phenotypes = GROAN.pea.yield,
noiseInjector = my.noiseInjector.bias2, #the new version
bias = 20 #if omitted the default would be used
)
Please pay attention to avoid naming conflicts between the new regressor arguments and the ones from createNoisyDataset. Once the function is defined and embedded in a noisyDataset it is possible to select a regressor and test the effect of the newly defined noise.
If none of the available regressors satisfies your needs it is possible to write a custom regressor for your experiments. This is done in two steps: 1) write the regressor function 2) include it in a WorkBench object.
Please understand that the regressor will be called in a crossvalidation procedures. It will thus have access to the complete genotypes (SNPs and/or covariances) and covariates, but the phenotype array will contain some missing values (NAs). This are the values to be predicted.
Each custom regressor function needs to accept the following input arguments:
Checks on input (type, size) are not required, since these functions are supposed to be used in a GROAN experiment, where checks are done when NoisyDataset and Workbench are created. Please note that the variables will reflect what is stored in the NoisyDataSet object. The function should return a list with the following fields:
The following snippet of code shows the implementation of phenoRegressor.dummy function. The “predictions” are actually random values, with genotypes, covariances and extra covariates being ignored. It is useful to show the required function signature.
phenoRegressor.dummy = function (phenotypes, genotypes, covariances, extraCovariates){
#no operation is actually required
return(list(
predictions = runif(length(phenotypes)), #predictions
hyperparams = c(some = 1, params='a'),
extradata = 'filler extra data for phenoRegressor.dummy'
))
}
The function above can be used as regressor argument for both createWorkbench and addRegressor functions.