To install the stable version of mcglm, use devtools::install_git(). For more information, visit mcglm/README.

library(devtools)
install_git("wbonat/mcglm")

library(mcglm)
packageVersion("mcglm")

## [1] '0.4.0'

Abstract

The mcglm package implements the multivariate covariance generalized linear models (McGLMs) proposed by Bonat and J$\o$rgensen (2016). The core fit function mcglm is employed for fitting a set of models. In this introductory vignette we restrict ourselves to model independent data, although a simple model for longitudinal data analysis in the Gaussian case is also presented. We present models to deal with continuous, binomial/bounded and count univariate response variables. We explore the specification of different link, variance and covariance functions.

Regression models for continuous data

Consider a simple regression model, for univariate and independent Gaussian data: $Y \sim N(X \beta, \tau_0 Z_0).$

# Loading extra packages
require(mcglm)
require(Matrix)
require(mvtnorm)
require(tweedie)

# Setting the seed
set.seed(2503)

# Fixed component
covariate <- seq(-1,1, l = 100)
X <- model.matrix(~ covariate)
mu1 <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL,
# Random component
y1 <- rnorm(100, mu1$mu, sd = 0.5) # Matrix linear predictor Z0 <- Diagonal(100, 1) # Data set data <- data.frame("y1" = y1, "covariate" = covariate) # Fit fit1.id <- mcglm(linear_pred = c(y1 ~ covariate), matrix_pred = list(list(Z0)), data = data)  ## Automatic initial values selected.  The mcglm package offers the following set of S3-methods for model summarize. print(methods(class = "mcglm"))  ## [1] anova coef confint fitted plot print ## [7] residuals summary vcov ## see '?methods' for accessing help and source code  The traditional summary for a fitted model can be obtained by summary(fit1.id)  ## Call: y1 ~ covariate ## ## Link function: identity ## Variance function: constant ## Covariance function: identity ## Regression: ## Estimates Std.error Z value ## (Intercept) 1.0029066 0.05277132 19.00477 ## covariate 0.9488496 0.09049311 10.48532 ## ## Dispersion: ## Estimates Std.error Z value ## 1 0.2784812 0.04214603 6.607532 ## ## Algorithm: chaser ## Correction: TRUE ## Number iterations: 4  The function summary.mcglm was designed to be similar the native summary functions for the classes lm and glm. Extra features were included to describe the dispersion structure. The mean formula call, along with the the link, variance and covariance functions are presented. The parameter estimates are presented in two blocks, the first presents the regression estimates while the second presents the dispersion estimates. In both cases the parameter estimates are summarize by point estimates, standard errors and Z-values associated with the Wald test whose null hypothesis is defined as $$\beta = 0$$ and $$\tau = 0$$ for the mean and dispersion structures, respectively. Finally, the selected algorithm, if the correction term is employed or not and the number of iterations is printed. The same linear regresion model can be fitted by using a different covariance link function, for example the inverse covariance link function. # Fit using inverse covariance link function fit1.inv <- mcglm(linear_pred = c(y1 ~ covariate), matrix_pred = list(list(Z0)), covariance = "inverse", data = data)  ## Automatic initial values selected.  Furthermore, we can use a less conventional covariance link function, as the exponential-matrix. # Fit using expm covariance link function fit1.expm <- mcglm(linear_pred = c(y1 ~ covariate), matrix_pred = list(list(Z0)), covariance = "expm", data = data)  ## Automatic initial values selected.  The function mcglm returns an object of mcglm class, for which we can use the method coef to extract the parameters estimates. # Comparing estimates using different covariance link functions cbind(coef(fit1.id)$Estimates,
coef(fit1.inv)$Estimates, coef(fit1.expm)$Estimates)

##           [,1]      [,2]       [,3]
## [1,] 1.0029066 1.0029066  1.0029066
## [2,] 0.9488496 0.9488496  0.9488496
## [3,] 0.2784812 3.5909038 -1.2784043

# Applying the inverse transformation
c(coef(fit1.id)$Estimates[3], 1/coef(fit1.inv)$Estimates[3],
exp(coef(fit1.expm)$Estimates[3]))  ## [1] 0.2784812 0.2784814 0.2784813  Consider an extension of the linear regression models to deal with heteroscedasticity: $Y \sim N(X \beta, \tau_0 Z_0 + \tau_1 Z_1),$ where $$Z_0$$ is a identity matrix and $$Z_1$$ is a diagonal matrix whose elements are given by the values of a known covariate. Such a model, can be fitted easily using the mcglm package. # Mean model covariate <- seq(-1,1, l = 100) X <- model.matrix(~ covariate) mu1 <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL, link = "identity") # Matrix linear predictor Z0 <- Diagonal(100, 1) Z1 <- Diagonal(100, c(rep(0,50),rep(1,50))) # Covariance model Sigma <- mcglm::mc_matrix_linear_predictor(tau = c(0.2, 0.15), Z = list(Z0,Z1)) y1 <- rnorm(100, mu1$mu, sd = sqrt(diag(Sigma)))
data <- data.frame("y1" = y1, "covariate" = covariate)

fit2.id <- mcglm(linear_pred = c(y1 ~ covariate),
matrix_pred = list(list(Z0,Z1)),
data = data)

## Automatic initial values selected.


We can also extend the linear regression model to deal with longitudinal data analysis. The code below presents an example of such a model.

# Mean model
covariate <- seq(-1,1, l = 200)
X <- model.matrix(~ covariate)
mu1 <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL,
# Covariance model
Z0 <- Diagonal(200, 1)
Z1.temp <- Matrix(rep(1,10)%*%t(rep(1,10)))
Z1.list <- list()
for(i in 1:20){Z1.list[[i]] <- Z1.temp}
Z1 <- bdiag(Z1.list)
Sigma <- mcglm::mc_matrix_linear_predictor(tau = c(0.2, 0.15), Z = list(Z0,Z1))

# Response variable
y1 <- as.numeric(rmvnorm(1, mean = mu1$mu, sigma = as.matrix(Sigma))) data <- data.frame("y1" = y1, "covariate" = covariate) # Fit fit3.id <- mcglm(linear_pred = c(y1 ~ covariate), matrix_pred = list("resp1" = list(Z0,Z1)), data = data)  ## Automatic initial values selected.  The model summary summary(fit3.id)  ## Call: y1 ~ covariate ## ## Link function: identity ## Variance function: constant ## Covariance function: identity ## Regression: ## Estimates Std.error Z value ## (Intercept) 0.9621971 0.07982881 12.053256 ## covariate 0.9589537 0.13667821 7.016142 ## ## Dispersion: ## Estimates Std.error Z value ## 1 0.2011307 0.02408285 8.351618 ## 2 0.1073397 0.04035983 2.659568 ## ## Algorithm: chaser ## Correction: TRUE ## Number iterations: 6  Note that, the dispersion structure now has two parameters. In that case, the parameter $$\tau_1$$ represents the longitudinal structure for which we are assuming a compound symmetry model. This model is an equivalent to a random intercept model in the context of Linear Mixed Models (LMMs). Regression models for binomial and bounded data The mcglm package offers a rich set of models to deal with binomial and bounded response variables. The logit, probit, cauchit, cloglog, and loglog link functions along with the extended binomial variance function combined with the linear covariance structure, provide a flexible class of models for handling binomial and bounded response variables. The extended binomial variance function is given by $$\mu^p (1- \mu)^q$$ where the two extra power parameters offer more flexibility to model the relationship between mean and variance. Consider the following simulated dataset. # Mean model covariate <- seq(-1,1, l = 250) X <- model.matrix(~ covariate) mu1 <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL, link = "logit") # Covariance model Z0 <- Diagonal(500, 1) # Response variable set.seed(123) y1 <- rbinom(500, prob = mu1$mu, size = 10)/10

# Data set
data <- data.frame("y1" = y1, "covariate" = covariate)


The most traditional regression model to deal with binomial data is the logistic regression model that can be fitted using the mcglm package using the following code:

# Fit
fit4.logit <- mcglm(linear_pred = c(y1 ~ covariate),
matrix_pred = list(list(Z0)),
link = "logit", variance = "binomialP",
power_fixed = TRUE,
Ntrial = list(rep(10,250)), data = data)

## Automatic initial values selected.


It is important to highlight that the response variable collumns should be between $$0$$ and $$1$$ and in the case of more than one trial the argument Ntrial should be used for fitting the model. The argument link specifies the link function whereas the argument variance specifies the variance function in that case binomialP. The variance function binomialP represents a simplification of the extended binomial variance function given by $$\mu^p (1- \mu)^p$$. Note that, in this example the argument 'power_fixed = TRUE' specifies that the power parameter $$p$$ will not be estimated, but fixed at the initial value $$p = 1$$ corresponding to the orthodox binomial variance function. We can easily fit the model using a different link function, for example the cauchit.

fit4.cauchit <- mcglm(linear_pred = c(y1 ~ covariate),
matrix_pred = list(list(Z0)),
link = "cauchit", variance = "binomialP",
Ntrial = list(rep(10,250)), data = data)

## Automatic initial values selected.


We can also estimate the extra power parameter $$p$$.

fit4.logitP <- mcglm(linear_pred = c(y1 ~ covariate),
matrix_pred = list(list(Z0)),
link = "logit", variance = "binomialP",
power_fixed = FALSE,
Ntrial = list(rep(10,250)), data = data)

## Automatic initial values selected.


Furthermore, we can estimate the two extra power parameters involved in the extended binomial variance function.

fit4.logitPQ <- mcglm(linear_pred = c(y1 ~ covariate),
matrix_pred = list(list(Z0)),
link = "logit", variance = "binomialPQ",
power_fixed = FALSE,
Ntrial = list(rep(10,250)),
control_algorithm = list(tuning = 0.5),
data = data)

## Automatic initial values selected.


The estimation of the extra power parameters involved in the extended binomial variance function is challenge mainly for small data sets. Note that, in this simulated example, we have to control the step-length of the chaser algorithm to avoid unrealistic values for the parameters involved in the covariance structure. To do that, we used the extra argument control_algorithm that should be a named list. For a detailed description of the arguments that can be passed to the control_algorithm function see ?fit_mcglm.

Regression models for count data

The analysis of count data in the mcglm package relies on the structure of the Poisson-Tweedie distribution. Such a distribution is characterized by the following dispersion functions:

$\nu(\mu, p) = \mu + \tau_0 \mu^p.$ The power parameter is an index that identify different distributions, examples include the Hermite ($$p = 0$$), Neyman-Type A ($$p = 1$$) and the negative binomial ($$p = 2$$).

The orthodox Poisson model can be fitted using the mcglm package using the Tweedie variance function $$\nu(\mu, p ) = \mu^p$$ where the power parameter $$p$$ is fixed at $$1$$. For example,

# Mean model
covariate <- seq(-2,2, l = 200)
X <- model.matrix(~ covariate)
mu <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL,

# Covariance model
Z0 <- Diagonal(200, 1)

# Data
y1 <- rpois(200, lambda = mu$mu) data <- data.frame("y1" = y1, "covariate" = covariate) # Fit fit.poisson <- mcglm(linear_pred = c(y1 ~ covariate), matrix_pred = list(list(Z0)), link = "log", variance = "tweedie", power_fixed = TRUE, data = data)  ## Automatic initial values selected.  Another very useful model for count data is the negative binomial. # Simulating negative binomial models require(tweedie) set.seed(1811) y1 <- rtweedie(200, mu = mu$mu, power = 2, phi = 0.5)
y <- rpois(200, lambda = y1)
data <- data.frame("y1" = y, "covariate" = covariate)

fit.pt <- mcglm(linear_pred = c(y ~ covariate),
matrix_pred = list(list(Z0)),
link = "log", variance = "poisson_tweedie",
power_fixed = FALSE, data = data)

## Automatic initial values selected.

summary(fit.pt)

## Call: y ~ covariate
##
## Variance function: poisson_tweedie
## Covariance function: identity
## Regression:
##             Estimates  Std.error  Z value
## (Intercept) 0.8552182 0.07044865 12.13960
## covariate   0.9285782 0.06284414 14.77589
##
## Power:
##   Estimates Std.error  Z value
## 1  2.237079 0.3461908 6.461982
##
## Dispersion:
##   Estimates Std.error  Z value
## 1 0.2711807 0.1923947 1.409502
##
## Algorithm: chaser
## Correction: TRUE
## Number iterations: 11


Note that, we estimate the power parameter rather than fix it at $$p = 2$$.