FIT

Koji Iwayama

2016-11-24

Introduction

FIT is a software package for integration of transcriptome data of samples in the field and meteorological data by modeling their relation. This software defines a statistical model of transcriptomes and provides an efficient method for training the model and for transcriptome prediction of unsequenced samples. Given the attributes of samples and meteorological data, this software predicts expression of a gene as \[ \hat{\boldsymbol s}=\beta_0 + \boldsymbol X \boldsymbol \beta, \] where \(\hat{\boldsymbol s}\) is the predictions of \(\log_2\)-transformed values of the normalized expression levels, and \(\boldsymbol \beta_0\) is a constant. Design matrix \(\boldsymbol X\) consists of the plant’s age and the genotype of the samples, the circadian clock, the response to environmental stimuli, and the interactions between the age and the clock and the age and the environmental response.

The plant’s age is the vector of the number of days after transplanting; it is scaled to have the mean of \(0\) and standard deviation of \(1\). The circadian clock is represented by the linear combination of the cosine and sine curves with a \(24~\mathrm{hr}\) period. The response to environmental stimuli is the cumulative sum of nonlinearly transformed environmental stimuli during a given time period.

The model is specified by a set of regression coefficients and other parameters that are used for transformation of meteorological data into the input variables for regression of the expectation values for the gene expression. Optimization of regression coefficients \(\beta_0\) and \(\boldsymbol \beta\) and variable selection are simultaneously performed using an adaptive group lasso (Wang and Leng (2008)). Thus, this software explores the regression coefficients minimizing following cost function: \[ \left(\hat{\boldsymbol s} - \boldsymbol s\right)^T \left(\hat{\boldsymbol s} - \boldsymbol s\right) +\lambda\left(\sum_{k\in \{d, r, dr, n\}}\zeta_k|\beta_k|+\zeta_{c}\sqrt{\beta_{cos}^2+\beta_{sin}^2}+\zeta_{dc}\sqrt{\beta_{dcos}^2+\beta_{dsin}^2}\right), \] where \(\boldsymbol s\) is the observed \(\log_2\)-transformed values of the normalized expressions, \(\lambda\) is the regularization parameter, and \(\zeta_j\) is the adaptive weight for penalizing each covariate. The values of parameters \(\lambda\) and \(\zeta_j\) are automatically selected in the software. Regression coefficients, \(\beta_d\), \(\beta_r\), \(\beta_{dr}\), \(\beta_{n}\), \(\beta_{cos}\), \(\beta_{sin}\), \(\beta_{dcos}\), and \(\beta_{dsin}\) correspond to the plant’s age, the response to environmental stimuli, their interaction, the genotype, the cosine and sine components of the circadian clock, and those of the interaction between the age and the circadian clock. Parameters related to the transformation of meteorological data are optimized by means of the Nelder-Mead algorithm (Nelder and Mead (1965)).

More details of the model is given in later sections or see the article by Iwayama et al. (2016).

Installation

FIT can be easily installed from CRAN by typing the following command in an R session:

install.packages('FIT')

To install on Windows, the INSTALL_opts option is required as follows:

install.packages("FIT", INSTALL_opts = "--no-multiarch")

Getting Started

To load the FIT package, enter the following command in an R session:

requireNamespace('FIT')

Here, using requireNamespace() to load the package and calling its API function with namespace qualifier FIT:: rather than loading via library() are recommended to avoid namespace contamination because the FIT package exports fairly ubiquitous names such as optim and predict as its API.

First, typical flow of the training of the model is shown below. Before starting, we need to load the attributes of samples, the meteorological data, and the expression data.

train.attribute.file <- system.file('extdata', 'train.attribute', package='FIT')
train.weather.file <- system.file('extdata', 'train.weather', package='FIT')
train.expression.file <- system.file('extdata', 'train.expression', package='FIT')

training.attribute  <- FIT::load.attribute(train.attribute.file);
## # Reading attribute data..done.
training.weather    <- FIT::load.weather(train.weather.file, 'weather',
                                        c('temperature', 'radiation'))
## # Reading weather data..done.
training.expression <- FIT::load.expression(train.expression.file, 'ex', 
                                           c('Os12g0189300', 'Os02g0724000', 'Os02g0139700', 'Os06g0133200'))
## # Reading expression data..done.

The first argument of these functions is the path of a file. If the file is a loadable .Rdata, then the name of a dataframe object in an .Rdata is specified by the second argument. Otherwise, data are loaded by dget() in the function. The third arguments of FIT::load.weather() and FIT::load.expression designate an array of weather factors to be taken into account during the construction of models and genes to be loaded, respectively. When we want to load all items or genes from the data, these arguments can be skipped.

Because the optimization by the Nelder-Mead algorithm depends on its initial values, it is desirable to select better initial model parameters. The FIT package offers a way to select the initial model parameters by means of a grid search. A grid of a parameter is specified by a list, where each element is a candidate value of the corresponding parameter variable. The following is an example of specification of a grid.

grid.coords <- list(
  env.temperature.threshold = c(10, 15, 20, 25, 30),
  env.temperature.amplitude = c(-100/30, -1/30, 1/30, 100/30),
  env.radiation.threshold = c(1, 10, 20, 30, 40),
  env.radiation.amplitude = c(-100/80, -1/80, 1/80, 100/80),  
  env.temperature.period = c(10, 30, 90, 270, 720, 1440, 1440*3),
  env.radiation.period = c(10, 30, 90, 270, 720, 1440, 1440*3),
  gate.temperature.phase = seq(0, 23*60, 1*60),
  gate.radiation.phase = seq(0, 23*60, 1*60),
  gate.temperature.threshold = cos(pi*seq(4,24,4)/24),
  gate.radiation.threshold = cos(pi*seq(4,24,4)/24),
  gate.temperature.amplitude = c(-5, 5),
  gate.radiation.amplitude = c(-5, 5)
)

The training of the model parameters consists of three stages: initialization of the model parameters, optimization of the parameters other than the regression coefficients, and fixation of the regression coefficients. Users can configure each stage of the training via a custom data structure recipe. A recipe can be constructed by the function FIT::make.recipe().

recipe <- FIT::make.recipe(c('temperature', 'radiation'), 
                           init = 'gridsearch',
                           optim = c('lm'),
                           fit = 'fit.lasso',
                           init.data = grid.coords,
                           time.step = 10)

The first argument specifies weather factors to be taken into account, i.e., information on temperature and radiation are used in this sample. This recipe configures the following procedure. At the first stage, the initial value of the model parameters is selected from grid points grid.coords via a grid search. At the second stage, the parameters are optimized by the Nelder-Mead algorithm. The regression coefficients are optimized by linear regression rather than the adaptive group lasso at this stage. After the optimization of the model parameters other than the regression coefficients at the second stage, the regression coefficients are fixed by the adaptive group lasso.

Using the recipe, we can train the model by means of the following code:

models <- FIT::train(training.expression,
                     training.attribute,
                     training.weather,
                     recipe)
## # * Training..
## # ** Prep+Init:
## # Prep (grids)
## # - D, N8, C
## # - E(temperature)
## # - E(radiation)
## # Init (grid search)
## # - init params for temperature
## # - init params for radiation
## # Prep (grids)
## # - D, N8, C
## # - E(temperature)
## # - E(radiation)
## # Init (grid search)
## # - init params for temperature
## # - init params for radiation
## # ** Optim (lm):
## # *** Lm:
## # optimizing Os12g0189300 
## # | temperature o | radiation o | => ( temperature , 740.1881 )
## # optimizing Os02g0724000 
## # | temperature o | radiation o | => ( temperature , 107.7171 )
## # optimizing Os02g0139700 
## # | temperature o | radiation o | => ( radiation , 65.74697 )
## # optimizing Os06g0133200 
## # | temperature o | radiation o | => ( temperature , 309.2262 )
## # ** Creating optimized models
## # Done (training)

Because function FIT::train() returns a list of lists of the trained models, it is convenient to simplify it to the list of the models by means of unlist().

models <- unlist(models)

Using the trained models, we can predict gene expression in unsequenced samples on the basis of the attributes of samples and the meteorological data.

prediction.attribute.file <- system.file('extdata', 'prediction.attribute', package = 'FIT')
prediction.weather.file <- system.file('extdata', 'prediction.weather', package = 'FIT')

prediction.attribute  <- FIT::load.attribute(prediction.attribute.file);
## # Reading attribute data..done.
prediction.weather    <- FIT::load.weather(prediction.weather.file, 'weather',
                                            c('temperature', 'radiation'))
## # Reading weather data..done.
prediction <- FIT::predict(models, prediction.attribute, prediction.weather)

To evaluate prediction accuracy, the software contains function FIT::prediction.errors(), which returns a list of the sum of squared errors.

prediction.expression.file <- system.file('extdata', 'prediction.expression', package = 'FIT')

prediction.expression <- FIT::load.expression(prediction.expression.file, 'ex', 
                                            c('Os12g0189300', 'Os02g0724000', 'Os02g0139700', 'Os06g0133200'))
## # Reading expression data..done.
prediction.errors <- FIT::prediction.errors(models,
                                           prediction.expression,
                                           prediction.attribute,
                                           prediction.weather)

FIT::predict() returns the list of predicted expression levels. An object representing the expression data holds the data as rawdata. The code for plotting the predicted and observed expression is shown below.

for(i in 1:length(prediction)){
  plot(prediction[[i]], prediction.expression$rawdata[,i], 
      xlab='prediction', ylab='observation')
  title(models[[i]]$gene)
}

Details of the model

As mentioned above, package FIT predicts gene expression levels using the following equation: \[ \hat{\boldsymbol s}=\beta_0 + \boldsymbol X \boldsymbol \beta. \] Regression coefficients \(\beta_0\) and \(\boldsymbol\beta\) are present as variable coef of the S4 object representing the model whose list is returned by function FIT::train().

models[[1]]$coefs
##               intercept                coef.age           coef.genotype 
##              13.2440143              -0.3018304              -0.8119858 
##          coef.clock.cos          coef.clock.sin       coef.ageClock.cos 
##              -0.6453803               2.2307285               0.0000000 
##       coef.ageClock.sin    coef.env.temperature coef.ageEnv.temperature 
##               0.0000000              -3.6344330              -0.9595616

Here, intercept is \(\beta_0\) and the remaining elements are those of \(\boldsymbol\beta\). Design matrix \(\boldsymbol X\) is constructed as \[ \boldsymbol X=\left( \boldsymbol d, \boldsymbol n, \boldsymbol c^{cos}, \boldsymbol c^{sin}, \boldsymbol r, \boldsymbol d\circ\boldsymbol c^{cos}, \boldsymbol d\circ\boldsymbol c^{sin}, \boldsymbol d\circ\boldsymbol r \right). \] Here, \(\boldsymbol a\circ\boldsymbol b\) means an element-wise product of two vectors \(\boldsymbol a\) and \(\boldsymbol b\).

The plant’s age \(\boldsymbol d\) is the vector of the numbers of days after transplanting scaled to have the mean of \(0\) and standard deviation of \(1\). Each element of vector \(\boldsymbol n\) indicates a genotype of a smaple. Elements coef.age and coef.genotype in coefs represent regression coefficients of the plant’s age and genotype, respectively.

The circadian clock in sample \(j\) is represented by the cosine and sine curves with a \(24~\mathrm{hr}\) period as \[ c^{cos}_j=\frac{\cos\left(2\pi\left(t_j\right)/24\right)}{2},\\ c^{sin}_j=\frac{\sin\left(2\pi\left(t_j-\varphi\right)/24\right)}{2}, \] where \(t_j\) is the time when the sample \(j\) was obtained. The regression coefficients of these two curves are coef.clock.cos and coef.clock.sin, respectively. The linear combination of these two curves is equal to the cosine curve, that is, \[ \beta_{cos}c^{cos}_j+\beta_{sin}c^{sin}_j=\sqrt{\beta_{cos}^2+\beta_{sin}^2} \frac{\cos\left(2\pi t_j-\arg\left(\beta_{cos}+i\beta_{sin}\right)\right)}{2}. \] Here, \(\arg\left(\beta_{cos}+i\beta_{sin}\right)\) is the gene specific phase of the circadian clock.

Through training, FIT selects the best environmental factor to explain the variation of gene expression. The selected environmental factor is represented by the variable env of the model object. The response to environmental stimuli is the cumulative sum of an environmental stimulus during a given period \(p\), that is, \[ r=\sum^t_{T=t-p}g(T)f(w_T-\theta). \] Here \(g(T)\) is a gate function that represents a diurnal change in a sensitivity to environmental stimuli. \(f(\cdot)\) is a response function that characterizes the type of response to stimuli. Parameters \(w_T\) and \(\theta\) represent the value of a meteorological parameter at time \(T\) and the response threshold, respectively. The parameters related to the response are contained in the model object as the variable params.

models[[1]]$params
##     env.temperature.period  env.temperature.amplitude 
##                242.6962611                105.3081528 
##  env.temperature.threshold     gate.temperature.phase 
##                 -0.9378069               1084.0998195 
## gate.temperature.amplitude gate.temperature.threshold 
##                  7.4123195                  0.1681979

Here, env.temperature.period and env.temperature.threshold are period \(p\) and threshold \(\theta\), respectively. The term between two “.” in the names represents env, that is, which environmental factor the model responds to. For instance, the model in question responds to temperature.

The gate function is defined as \[ g(T)= \frac{ \tanh\left( \exp\left(\gamma_g\right) \left( \cos\left( 2\pi\left(T-\psi\right)/24 \right) -\theta_g \right) \right) -\tanh\left( \exp\left(\gamma_g\right)\left(-1-\theta_g\right) \right) } { \tanh\left( \exp\left(\gamma_g\right)\left(1-\theta_g\right) \right) -\tanh\left( \exp\left(\gamma_g\right)\left(-1-\theta_g\right) \right) }, \] where \(\psi_g\) determines at what time of day the gene is most sensitive to environmental stimuli, and \(\gamma_g\) and \(\theta_g\) control the shape and the opening length of the gate, respectively. A smaller value of \(\theta_g\) results in longer time of opening of the gate. The shape of this function becomes approximately rectangular with a smaller value of \(\gamma_g\) and becomes a cosine curve with a larger value of \(\gamma_g\). In params, \(\psi_g\), \(\gamma_g\), and \(\theta_g\) are present as gate.*.phase, gate.*.amplitude, and gate.*.threshold, respectively (“*" is an environmental factor).

We can consider two types of the response functions. One type responds to environmental stimuli if and only if it is greater than the threshold. On the other hand, the other type responds to stimuli smaller than the threshold. These two types of the response functions are defined as \[ f_{p}(x)=\max\left(0, \tanh\left(\exp\left(\gamma_f\right)x\right)\right)\sqrt{\exp\left(-2\gamma_f\right)+1},\\ f_{n}(x)=\max\left(0, \tanh\left(-\exp\left(\gamma_f\right)x\right)\right)\sqrt{\exp\left(-2\gamma_f\right)+1}. \] Here, \(f_{p}(x)\) is the former type, and \(f_{n}(x)\) is the latter type of the response function. The better type of the response function is chosen at the stage of the optimization of the parameters. It \(f_{p}(x)\) is chosen, the value of response.type of the model object is \(1\). Otherwise, it is \(-1\). As \(\gamma_f\) approaches minus infinity, the response approaches a dose-dependent response. Conversely, the response approaches a dose-independent response in the limit \(\gamma_f\rightarrow\infty\). Element params$env.*.amplitude represents \(\gamma_f\).

During training, FIT normalizes the values of meteorological data of each environmental factor to have the mean of \(0\) and standard deviation of \(1\) as the plant’s age. The mean values and standard deviations of raw data are held in input.mean and input.sd of the model object.

Application to RNA-Seq data

FIT assumes that the observed expression conforms to a log-normal distribution to which microarray data can be fitted well. RNA-Seq, which is also a widely-used technology for quantification of the transcriptome, is discrete in nature and modeled by the negative binomial distribution. To apply FIT to RNA-Seq data, we can use a precision weight method as in voom (Law et al. (2014)). FIT contains function FIT::load.weight() to load the weight matrix.

rna.seq.file <- system.file('extdata', 'rna-seq', package='FIT')
weight <- FIT::load.weight(rna.seq.file, 'weights', genes)
## # Reading weight data..done.

For the detailed procedure of construction of precision weights, see the articles by Law et al. (2014) and Iwayama et al. (2016).

FIT uses the log-counts per million (log-cpm) values as gene expression data.

training.expression.rnaseq <- FIT::load.expression(rna.seq.file, 'log.cpm', genes)
## # Reading expression data..done.

The recipe can be configured as in the cases above.

recipe.rnaseq <- FIT::make.recipe(c('temperature', 'radiation'), 
                           init = 'gridsearch',
                           optim = c('lm'),
                           fit = 'fit.lasso',
                           init.data = grid.coords,
                           time.step = 10)

To associate precision weight with expression data, we need to specify the weight object for argument weight of function FIT::train().

models.rnaseq <- unlist(FIT::train(training.expression.rnaseq,
                             training.attribute,
                             training.weather,
                             recipe.rnaseq,
                             weight))
## # * Training..
## # ** Prep+Init:
## # Prep (grids)
## # - D, N8, C
## # - E(temperature)
## # - E(radiation)
## # Init (grid search)
## # - init params for temperature
## # - init params for radiation
## # Prep (grids)
## # - D, N8, C
## # - E(temperature)
## # - E(radiation)
## # Init (grid search)
## # - init params for temperature
## # - init params for radiation
## # ** Optim (lm):
## # *** Lm:
## # optimizing Os03g0197000 
## # | temperature o | radiation o | => ( temperature , 986.6684 )
## # optimizing Os01g0892600 
## # | temperature o | radiation o | => ( temperature , 975.3573 )
## # optimizing Os07g0630800 
## # | temperature o | radiation o | => ( radiation , 2502.26 )
## # optimizing Os01g0700100 
## # | temperature o | radiation o | => ( temperature , 10670.51 )
## # ** Creating optimized models
## # Done (training)
prediction.rnaseq <- FIT::predict(models.rnaseq, prediction.attribute, prediction.weather)

for(i in 1:length(prediction.rnaseq)){
  plot(prediction.rnaseq[[i]], prediction.expression2$rawdata[,i], 
      xlab='prediction', ylab='observation')
  title(models.rnaseq[[i]]$gene)
}

References

Iwayama, Koji, Yuri Aisaka, Natsumaro Kutsuna, and Atsushi J Nagano. 2016. “FIT:statistical modeling tool for transcriptome dynamics under fluctuating field conditions.”

Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. 2014. “voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology 15 (2): R29.

Nelder, John A, and Roger Mead. 1965. “A Simplex Method for Function Minimization.” The Computer Journal 7 (4): 308–13.

Wang, Hansheng, and Chenlei Leng. 2008. “A note on adaptive group lasso.” Computational Statistics and Data Analysis 52 (12): 5277–86.