quickReg

Xikun Han, hanxikun2014@163.com

2016-08-27

A manual to show the R package quickReg.

Introduction

The quickReg package concentrates on a set of functions to display and pry a dataset. More precisely, the package can display statistical description for a dataset, build univariate regression models for lm, glm and cox regression based on specified variables. More importantly, the package provides several seamless functions to display these regressions. Several examples are used to explain the idea.

Getting started

The example data is a hypothetical dataset extracting a subset from package PredictABEL. It has no practical implications and only be used to demostrate the main idea of the package.

# If you haven't install the package, you can download it from cran

# install.packages("quickReg")

library(quickReg)

# Load the dataset

data(diabetes)

# Show the first 6 rows of the data

head(diabetes)
##   sex age smoking education diabetes BMI systolic diastolic CFHrs1061170 LOCrs10490924 CFHrs1410996 C2rs9332739 CFBrs641153 CFHrs2230199
## 1   1  44       1         0        1  40      129        91            1             2            2           1           1            0
## 2   0  53       0         0        0  29      137        98            2             1            1           1           0            0
## 3   1  46       1         0        0  29      136        93            1             1            2           1           1            1
## 4   1  63       0         0        0  29      176       119            1             0            1           1           0            0
## 5   0  60      NA         0        1  30      148       107            1             2            1           1           0            2
## 6   0  52       0         1        1  29      133        91            1             1            1           1           1            0

We can use the function diaplay to show statistical descriptions of the data.

show_data<-display(diabetes)

# We can show the results with indices or just the name of variables

show_data[1:2]
## $sex
## $sex$split_line
## [1] "================================================================================"
## 
## $sex$table
##                 0       1
## count     572.000 428.000
## propotion   0.572   0.428
## 
## 
## $age
## $age$split_line
## [1] "================================================================================"
## 
## $age$summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   33.00   48.75   58.00   58.98   67.00   94.00 
## 
## $age$describe
##     n  mean    sd median trimmed   mad min max range skew kurtosis   se
##  1000 58.98 13.27     58   58.32 13.34  33  94    61 0.38    -0.38 0.42
## 
## $age$normality
## [1] "Shapiro-Wilk normality test, statistic = 0.97864, p-value = 5.976e-11"
show_data$BMI
## $split_line
## [1] "================================================================================"
## 
## $summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    22.0    28.0    31.0    31.1    34.0    43.0       6 
## 
## $describe
##    n mean   sd median trimmed  mad min max range skew kurtosis   se
##  994 31.1 3.79     31   30.98 4.45  22  43    21 0.32    -0.18 0.12
## 
## $normality
## [1] "Shapiro-Wilk normality test, statistic = 0.9852, p-value = 1.724e-08"

Build regression models

# Apply univariate regression models

reg_glm<-reg(data = diabetes, y = 5, factor = c(1, 3, 4), model = 'glm')

# reg_glm have two componets, the regression models in detail and a concentrated data frame

# We can show the detail information with: reg_glm$detail, detail(reg_glm)

reg_glm$detail$BMI
## $split_line
## [1] "================================================================================"
## 
## $summary
## 
## Call:
## glm(formula = y ~ x_one, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6991  -0.6617  -0.6435  -0.6142   1.9037  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.81164    0.67066  -1.210    0.226
## x_one       -0.02055    0.02153  -0.955    0.340
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 967.02  on 993  degrees of freedom
## Residual deviance: 966.10  on 992  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 970.1
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## $`OR(95%CI)`
##                           2.5 %   97.5 %
## (Intercept) 0.4441276 0.1193405 1.658213
## x_one       0.9796557 0.9388467 1.021610
# To show the concentrated data frame: reg_glm$dataframe, dataframe(reg_glm)

dataframe(reg_glm)
##             term      estimate   std.error   statistic      p.value        OR    OR.low  OR.high
## 2           sex1 -0.0995619364 0.163419266 -0.60924234 5.423638e-01 0.9052339 0.6555804 1.244991
## 4            age -0.0016515166 0.006083056 -0.27149453 7.860107e-01 0.9983498 0.9864257 1.010256
## 6       smoking1  0.2203884367 0.171356638  1.28613889 1.983946e-01 1.2465608 0.8917694 1.747266
## 8     education1  0.0072440035 0.169823173  0.04265615 9.659756e-01 1.0072703 0.7191236 1.400591
## 10           BMI -0.0205541093 0.021530295 -0.95465990 3.397497e-01 0.9796557 0.9388467 1.021610
## 12      systolic -0.0001758354 0.004399858 -0.03996388 9.681219e-01 0.9998242 0.9911130 1.008379
## 14     diastolic -0.0010196342 0.007323325 -0.13923104 8.892676e-01 0.9989809 0.9845762 1.013284
## 16  CFHrs1061170  0.1648181445 0.108731134  1.51583211 1.295618e-01 1.1791787 0.9534430 1.460814
## 18 LOCrs10490924  0.6243454613 0.112922906  5.52895320 3.221473e-08 1.8670235 1.4977946 2.332986
## 20  CFHrs1410996  0.3154310240 0.128347280  2.45763699 1.398545e-02 1.3708501 1.0705744 1.771825
## 22   C2rs9332739  1.0717936770 0.433256076  2.47381107 1.336804e-02 2.9206134 1.3543217 7.626019
## 24   CFBrs641153  0.1993582016 0.253688461  0.78583866 4.319620e-01 1.2206191 0.7567549 2.055336
## 26  CFHrs2230199  0.3402726917 0.125293121  2.71581303 6.611324e-03 1.4053308 1.0974578 1.794700
# Linear model and cox regression model are also avaiable

reg_lm<-reg(data = diabetes, x = c(1:6,8:12), y = 7, factor = c(1, 3, 4), model = 'lm')

# Use varible names
reg_coxph<-reg(data = diabetes, y = "diabetes", time = "age", factor = c("sex", "smoking", "education"), model = 'coxph')


# Display could be used to a reg class to summarize univariate models

display(reg_glm)
## 
## Call:
## reg(data = diabetes, y = 5, factor = c(1, 3, 4), model = "glm")
## 
## Number of variables: 13  
## Number of terms: 13  
## Number of significant terms(alpha=0.05): 4   
## 
## Cumulative number of terms:
## 
##           number of terms
## p < 0.001               1
## p < 0.01                2
## p < 0.05                4
## p < 0.1                 4
## p < 1                  13
## 
## 
## p < 0.001:  LOCrs10490924 
## 
## 
## p < 0.01:  LOCrs10490924, CFHrs2230199 
## 
## 
## p < 0.05:  LOCrs10490924, CFHrs1410996, C2rs9332739, CFHrs2230199 
## 
## 
## p < 0.1:  LOCrs10490924, CFHrs1410996, C2rs9332739, CFHrs2230199 
## 
## 
## p < 1:  sex1, age, smoking1, education1, BMI, systolic, diastolic, CFHrs1061170, LOCrs10490924, CFHrs1410996, C2rs9332739, CFBrs641153, CFHrs2230199
display(reg_lm)
## 
## Call:
## reg(data = diabetes, x = c(1:6, 8:12), y = 7, factor = c(1, 3,     4), model = "lm")
## 
## Number of variables: 11  
## Number of terms: 11  
## Number of significant terms(alpha=0.05): 4   
## 
## Cumulative number of terms:
## 
##           number of terms
## p < 0.001               3
## p < 0.01                3
## p < 0.05                4
## p < 0.1                 4
## p < 1                  11
## 
## 
## p < 0.001:  age, BMI, diastolic 
## 
## 
## p < 0.01:  age, BMI, diastolic 
## 
## 
## p < 0.05:  sex1, age, BMI, diastolic 
## 
## 
## p < 0.1:  sex1, age, BMI, diastolic 
## 
## 
## p < 1:  sex1, age, smoking1, education1, diabetes, BMI, diastolic, CFHrs1061170, LOCrs10490924, CFHrs1410996, C2rs9332739
display(reg_coxph)
## 
## Call:
## reg(data = diabetes, y = "diabetes", factor = c("sex", "smoking",     "education"), model = "coxph", time = "age")
## 
## Number of variables: 12  
## Number of terms: 12  
## Number of significant terms(alpha=0.05): 6   
## 
## Cumulative number of terms:
## 
##           number of terms
## p < 0.001               2
## p < 0.01                4
## p < 0.05                6
## p < 0.1                 6
## p < 1                  12
## 
## 
## p < 0.001:  systolic, LOCrs10490924 
## 
## 
## p < 0.01:  BMI, systolic, LOCrs10490924, CFHrs2230199 
## 
## 
## p < 0.05:  BMI, systolic, LOCrs10490924, CFHrs1410996, C2rs9332739, CFHrs2230199 
## 
## 
## p < 0.1:  BMI, systolic, LOCrs10490924, CFHrs1410996, C2rs9332739, CFHrs2230199 
## 
## 
## p < 1:  sex1, smoking1, education1, BMI, systolic, diastolic, CFHrs1061170, LOCrs10490924, CFHrs1410996, C2rs9332739, CFBrs641153, CFHrs2230199

Plot regression models

# `quickReg` package provides forest plot for univariate regression models

plot(reg_glm)

# One OR value is larger than others, we can set the limits
plot(reg_glm,limits=c(NA,3))

plot(reg_glm,limits=c(1,2))

# Sort the variables according to alphabetical

plot(reg_glm,limits=c(NA,3), sort ="alphabetical")

# Similarly, we can plot lm and cox regression results

plot(reg_lm,limits=c(-2,5))

plot(reg_coxph,limits=c(0.5,2))

# Modify plot.reg like ggplot2, add themes from package `ggthemes` 
library(ggplot2);library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.3.1
plot(reg_coxph,limits=c(0.5,2))+
  labs(list(title = "Logistic Regression Model", x = "variables"))+
  theme_classic() %+replace% 
  theme(legend.position ="none",axis.text.x=element_text(angle=45,size=rel(1.5)))

Perspective

The quickReg package provides a flexible and convenient way to dispaly data and the association between variables. This vignette offers a glimpse of its use and features. The source code and help files are more helpful. The package is ongoing. Seamless subgroup analysis, more regression types and adjusted models may be avaliable in the future. Please contact me with any comments, questions and bug reports.