MachineShop: Machine Learning Models and Tools

CRAN_Status_Badge

Overview

MachineShop is a meta-package for statistical and machine learning with a common interface for model fitting, prediction, performance assessment, and presentation of results. Support is provided for predictive modeling of numerical, categorical, and censored time-to-event outcomes, including those listed in the table below, and for resample (bootstrap, cross-validation, and split training-test sets) estimation of model performance.

Response Variable Types

Method

Constructor

Categorical1

Continuous2

Survival3

Bagging with Classification Trees

AdaBagModel

f

Boosting with Classification Trees

AdaBoostModel

f

Bayesian Additive Regression Trees

BARTMachineModel

b

n

Gradient Boosting with Regression Trees

BlackBoostModel

b

n

S

C5.0 Classification

C50Model

f

Conditional Random Forests

CForestModel

f

n

S

Cox Regression

CoxModel

S

Cox Regression (Stepwise)

CoxStepAICModel

S

Multivariate Adaptive Regression Splines

EarthModel

f

n

Flexible Discriminant Analysis

FDAModel

f

Gradient Boosting with Additive Models

GAMBoostModel

b

n

S

Generalized Boosted Regression

GBMModel

f

n

S

Gradient Boosting with Linear Models

GLMBoostModel

b

n

S

Generalized Linear Models

GLMModel

b

n

Generalized Linear Models (Stepwise)

GLMStepAICModel

b

n

Lasso and Elastic-Net

GLMNetModel

f

m, n

S

K-Nearest Neighbors Model

KNNModel

f, o

n

Least Angle Regression

LARSModel

n

Linear Discriminant Analysis

LDAModel

f

Linear Model

LMModel

f

m, n

Mixture Discriminant Analysis

MDAModel

f

Naive Bayes Classifier

NaiveBayesModel

f

Feed-Forward Neural Networks

NNetModel

f

n

Penalized Discriminant Analysis

PDAModel

f

Partial Least Squares

PLSModel

f

n

Ordered Logistic Regression

POLRModel

o

Quadratic Discriminant Analysis

QDAModel

f

Random Forests

RandomForestModel

f

n

Fast Random Forests

RangerModel

f

n

S

Recursive Partitioning and Regression Trees

RPartModel

f

n

S

Stacked Regression

StackedModel

f, o

m, n

S

Super Learner

SuperModel

f, o

m, n

S

Parametric Survival

SurvRegModel

S

Parametric Survival (Stepwise)

SurvRegStepAICModel

S

Support Vector Machines

SVMModel

f

n

Support Vector Machines (ANOVA)

SVMANOVAModel

f

n

Support Vector Machines (Bessel)

SVMBesselModel

f

n

Support Vector Machines (Laplace)

SVMLaplaceModel

f

n

Support Vector Machines (Linear)

SVMLinearModel

f

n

Support Vector Machines (Poly)

SVMPolyModel

f

n

Support Vector Machines (Radial)

SVMRadialModel

f

n

Support Vector Machines (Spline)

SVMSplineModel

f

n

Support Vector Machines (Tanh)

SVMTanhModel

f

n

Regression and Classification Trees

TreeModel

f

n

Extreme Gradient Boosting

XGBModel

f

n

Extreme Gradient Boosting (DART)

XGBDARTModel

f

n

Extreme Gradient Boosting (Linear)

XGBLinearModel

f

n

Extreme Gradient Boosting (Tree)

XGBTreeModel

f

n

1 b = binary, f = factor, o = ordered

2 m = matrix, n = numeric

3 S = Surv

Installation

# Current release from CRAN
install.packages("MachineShop")

# Development version from GitHub
# install.packages("devtools")
devtools::install_github("brian-j-smith/MachineShop", ref = "develop")

# Development version with vignettes
devtools::install_github("brian-j-smith/MachineShop", ref = "develop", build_vignettes = TRUE)

Documentation

Once the package is installed, general documentation on its usage can be viewed with the following console commands.

library(MachineShop)

# Package help summary
?MachineShop

# Vignette
RShowDoc("Introduction", package = "MachineShop")

Parallel Computing

Resampling algorithms will be executed in parallel automatically if a parallel backend for the foreach package, such as doParallel, is loaded.

library(doParallel)
registerDoParallel(cores = 4)

Example

The following is a brief example illustrating use of the package to predict the species of flowers in Edgar Anderson’s iris data set.

Training and Test Set Analysis

## Load the package
library(MachineShop)
library(magrittr)

## Iris flower species (3 level response) data set
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

## Training and test sets
set.seed(123)
trainindices <- sample(nrow(iris), nrow(iris) * 2 / 3)
train <- iris[trainindices, ]
test <- iris[-trainindices, ]

## Model formula
fo <- Species ~ .

## Models by response type
modelinfo(factor(0)) %>% names
#>  [1] "AdaBagModel"       "AdaBoostModel"     "C50Model"         
#>  [4] "CForestModel"      "EarthModel"        "FDAModel"         
#>  [7] "GBMModel"          "GLMNetModel"       "KNNModel"         
#> [10] "LDAModel"          "LMModel"           "MDAModel"         
#> [13] "NaiveBayesModel"   "NNetModel"         "PDAModel"         
#> [16] "PLSModel"          "QDAModel"          "RandomForestModel"
#> [19] "RangerModel"       "RPartModel"        "StackedModel"     
#> [22] "SuperModel"        "SVMModel"          "SVMANOVAModel"    
#> [25] "SVMBesselModel"    "SVMLaplaceModel"   "SVMLinearModel"   
#> [28] "SVMPolyModel"      "SVMRadialModel"    "SVMSplineModel"   
#> [31] "SVMTanhModel"      "TreeModel"         "XGBModel"         
#> [34] "XGBDARTModel"      "XGBLinearModel"    "XGBTreeModel"

## Model-specific information
modelinfo(GBMModel)
#> $GBMModel
#> $GBMModel$label
#> [1] "Generalized Boosted Regression"
#> 
#> $GBMModel$packages
#> [1] "gbm"
#> 
#> $GBMModel$types
#> [1] "factor"  "numeric" "Surv"   
#> 
#> $GBMModel$arguments
#> function (distribution = NULL, n.trees = 100, interaction.depth = 1, 
#>     n.minobsinnode = 10, shrinkage = 0.1, bag.fraction = 0.5) 
#> NULL
#> 
#> $GBMModel$varimp
#> [1] TRUE

## Generalized boosted model fit to training set
gbmfit <- fit(fo, data = train, model = GBMModel)

## Variable importance
(vi <- varimp(gbmfit))
#>                 Overall
#> Petal.Length 100.000000
#> Petal.Width   14.601856
#> Sepal.Width    1.438558
#> Sepal.Length   0.000000

plot(vi)

## Test set predicted probabilities
predict(gbmfit, newdata = test, type = "prob") %>% head
#>         setosa   versicolor    virginica
#> [1,] 0.9999737 2.627994e-05 4.493784e-08
#> [2,] 0.9999154 8.456383e-05 4.205211e-09
#> [3,] 0.9999154 8.456383e-05 4.205211e-09
#> [4,] 0.9999737 2.627994e-05 4.493784e-08
#> [5,] 0.9998807 1.192834e-04 2.987679e-09
#> [6,] 0.9999024 9.764241e-05 2.445639e-09

## Test set predicted classifications
predict(gbmfit, newdata = test) %>% head
#> [1] setosa setosa setosa setosa setosa setosa
#> Levels: setosa versicolor virginica

## Test set performance
obs <- response(fo, data = test)
pred <- predict(gbmfit, newdata = test, type = "prob")
performance(obs, pred)
#>  Accuracy     Kappa    ROCAUC     Brier 
#> 0.9200000 0.8793727 0.9837585 0.1502442

Resampling

## Resample estimation of model performance
(res <- resample(fo, data = iris, model = GBMModel, control = CVControl))
#> An object of class "Resamples"
#> 
#> Models: GBMModel
#> Stratification variable: (strata) 
#> 
#> An object from class "MLControl"
#> 
#> Name: CVControl
#> Label: K-Fold Cross-Validation
#> Folds: 10
#> Repeats: 1
#> Seed: 253452646

summary(res)
#>                Mean    Median         SD          Min       Max NA
#> Accuracy 0.95333333 0.9333333 0.03220306 9.333333e-01 1.0000000  0
#> Kappa    0.93000000 0.9000000 0.04830459 9.000000e-01 1.0000000  0
#> ROCAUC   0.98800000 1.0000000 0.02305120 9.333333e-01 1.0000000  0
#> Brier    0.08476969 0.1133233 0.05621648 5.140496e-05 0.1372037  0

plot(res)

Performance Metrics

## Default performance metrics
performance(res) %>% summary
#>                Mean    Median         SD          Min       Max NA
#> Accuracy 0.95333333 0.9333333 0.03220306 9.333333e-01 1.0000000  0
#> Kappa    0.93000000 0.9000000 0.04830459 9.000000e-01 1.0000000  0
#> ROCAUC   0.98800000 1.0000000 0.02305120 9.333333e-01 1.0000000  0
#> Brier    0.08476969 0.1133233 0.05621648 5.140496e-05 0.1372037  0

## All available metric functions
metricinfo() %>% names
#>  [1] "accuracy"        "brier"           "cindex"         
#>  [4] "cross_entropy"   "f_score"         "gini"           
#>  [7] "kappa2"          "mae"             "mse"            
#> [10] "msle"            "npv"             "ppv"            
#> [13] "pr_auc"          "precision"       "r2"             
#> [16] "recall"          "rmse"            "rmsle"          
#> [19] "roc_auc"         "roc_index"       "sensitivity"    
#> [22] "specificity"     "weighted_kappa2"

## Metrics available for resample output
metricinfo(res) %>% names
#> [1] "accuracy"      "brier"         "cross_entropy" "kappa2"       
#> [5] "pr_auc"        "roc_auc"

## User-specified metrics
performance(res, c(accuracy, kappa2)) %>% summary
#>               Mean    Median         SD       Min Max NA
#> accuracy 0.9533333 0.9333333 0.03220306 0.9333333   1  0
#> kappa2   0.9300000 0.9000000 0.04830459 0.9000000   1  0

Model Tuning

## Tune over a grid of model parameters
gbmtune <- tune(fo, data = iris, model = GBMModel,
                grid = expand.grid(n.trees = c(25, 50, 100),
                                   interaction.depth = 1:3,
                                   n.minobsinnode = c(5, 10)))

plot(gbmtune, type = "line")

## Fit the selected model
gbmtunefit <- fit(fo, data = iris, model = gbmtune)
varimp(gbmtunefit)
#>                Overall
#> Petal.Length 100.00000
#> Petal.Width   28.00921
#> Sepal.Width    2.30804
#> Sepal.Length   0.00000

Model Comparisons

## Model comparisons
control <- CVControl(folds = 10, repeats = 5)

gbmres <- resample(fo, data = iris, model = GBMModel(n.tree = 50), control = control)
rfres <- resample(fo, data = iris, model = RandomForestModel(ntree = 50), control = control)
nnetres <- resample(fo, data = iris, model = NNetModel(size = 5), control = control)

res <- Resamples(GBM = gbmres, RF = rfres, NNet = nnetres)
summary(res)
#> , , Accuracy
#> 
#>           Mean    Median         SD Min Max NA
#> GBM  0.9466667 0.9333333 0.05387480 0.8   1  0
#> NNet 0.9373333 1.0000000 0.11773370 0.4   1  0
#> RF   0.9560000 0.9666667 0.05321415 0.8   1  0
#> 
#> , , Kappa
#> 
#>       Mean Median         SD  Min Max NA
#> GBM  0.920   0.90 0.08081220  0.7   1  0
#> NNet 0.902   1.00 0.21236280 -0.3   1  0
#> RF   0.934   0.95 0.07982123  0.7   1  0
#> 
#> , , ROCAUC
#> 
#>           Mean Median         SD       Min Max NA
#> GBM  0.9897333      1 0.01962043 0.9200000   1  0
#> NNet 0.9661333      1 0.10392602 0.3000000   1  0
#> RF   0.9939333      1 0.01405674 0.9266667   1  0
#> 
#> , , Brier
#> 
#>            Mean      Median         SD          Min       Max NA
#> GBM  0.08282749 0.063761882 0.08832055 2.740079e-05 0.3850368  0
#> NNet 0.09151662 0.005049339 0.13523087 1.367731e-51 0.6666668  0
#> RF   0.06990080 0.048453333 0.07184654 1.600000e-04 0.3176000  0

plot(res)

## Pairwise model differences and t-tests
perfdiff <- diff(res)
summary(perfdiff)
#> , , Accuracy
#> 
#>              Mean Median         SD         Min        Max NA
#> GBM - NNet  0.008      0 0.09955912 -0.06666667 0.53333333  0
#> GBM - RF   -0.008      0 0.03198639 -0.06666667 0.06666667  0
#> NNet - RF  -0.016      0 0.11070584 -0.60000000 0.06666667  0
#> 
#> , , Kappa
#> 
#>              Mean Median         SD  Min Max NA
#> GBM - NNet  0.014      0 0.16538144 -0.1 0.9  0
#> GBM - RF   -0.012      0 0.04797959 -0.1 0.1  0
#> NNet - RF  -0.026      0 0.18273433 -1.0 0.1  0
#> 
#> , , ROCAUC
#> 
#>               Mean Median         SD         Min        Max NA
#> GBM - NNet  0.0236      0 0.10167049 -0.06666667 0.67333333  0
#> GBM - RF   -0.0042      0 0.01243231 -0.05333333 0.01666667  0
#> NNet - RF  -0.0278      0 0.10178366 -0.68000000 0.05333333  0
#> 
#> , , Brier
#> 
#>                    Mean        Median        SD         Min       Max NA
#> GBM - NNet -0.008689127  0.0007342389 0.1032869 -0.49047781 0.1318379  0
#> GBM - RF    0.012926694  0.0037022372 0.0339893 -0.04799376 0.1196612  0
#> NNet - RF   0.021615821 -0.0011220887 0.1108912 -0.13237306 0.5181868  0

t.test(perfdiff)
#> An object of class "HTestPerformanceDiff"
#> 
#> Upper diagonal: mean differences (row - column)
#> Lower diagonal: p-values
#> P-value adjustment method: holm
#> 
#> , , Accuracy
#> 
#>            GBM      NNet     RF
#> GBM         NA 0.0080000 -0.008
#> NNet 0.6236368        NA -0.016
#> RF   0.2495969 0.6236368     NA
#> 
#> , , Kappa
#> 
#>            GBM      NNet     RF
#> GBM         NA 0.0140000 -0.012
#> NNet 0.6386268        NA -0.026
#> RF   0.2495969 0.6386268     NA
#> 
#> , , ROCAUC
#> 
#>             GBM      NNet      RF
#> GBM          NA 0.0236000 -0.0042
#> NNet 0.11847985        NA -0.0278
#> RF   0.06239545 0.1184798      NA
#> 
#> , , Brier
#> 
#>             GBM         NNet         RF
#> GBM          NA -0.008689127 0.01292669
#> NNet 0.55467315           NA 0.02161582
#> RF   0.02928367  0.348712945         NA

plot(perfdiff)

Ensemble Models

## Stacked regression
stackedres <- resample(fo, data = iris, model = StackedModel(GBMModel, RandomForestModel, NNetModel))
summary(stackedres)
#>                Mean     Median         SD          Min       Max NA
#> Accuracy 0.95333333 0.96666667 0.05488484 8.666667e-01 1.0000000  0
#> Kappa    0.93000000 0.95000000 0.08232726 8.000000e-01 1.0000000  0
#> ROCAUC   0.99866667 1.00000000 0.00421637 9.866667e-01 1.0000000  0
#> Brier    0.07104383 0.04671106 0.07601605 4.280279e-05 0.1982373  0

## Super learner
superres <- resample(fo, data = iris, model = SuperModel(GBMModel, RandomForestModel, NNetModel))
summary(superres)
#>                Mean    Median         SD          Min       Max NA
#> Accuracy 0.95333333 0.9666667 0.06324555 8.000000e-01 1.0000000  0
#> Kappa    0.93000000 0.9500000 0.09486833 7.000000e-01 1.0000000  0
#> ROCAUC   0.98133333 1.0000000 0.03155243 9.200000e-01 1.0000000  0
#> Brier    0.06817479 0.0322495 0.09179246 4.466662e-06 0.2782804  0

Calibration Curves

cal <- calibration(res)
plot(cal, se = TRUE)

Confusion Matrices

(conf <- confusion(gbmres, cutoff = NULL))
#> GBMModel :
#>             Observed
#> Predicted          setosa   versicolor    virginica
#>   setosa     249.24113696   0.25037765   0.09683350
#>   versicolor   0.74235768 228.85744589  23.18256928
#>   virginica    0.01650536  20.89217646 226.72059721

summary(conf)
#> GBMModel :
#> Number of responses: 750
#> Accuracy (SE): 0.9397589 (0.008688084)
#> Majority class: 0.3333333
#> Kappa: 0.9096384
#> 
#>                setosa versicolor virginica
#> Observed    0.3333333  0.3333333 0.3333333
#> Predicted   0.3327845  0.3370432 0.3301724
#> Agreement   0.3323215  0.3051433 0.3022941
#> Sensitivity 0.9969645  0.9154298 0.9068824
#> Specificity 0.9993056  0.9521501 0.9581826
#> PPV         0.9986089  0.9053537 0.9155646
#> NPV         0.9984835  0.9574783 0.9536609
plot(conf)

Partial Dependence Plots

pd <- dependence(gbmfit, select = c(Petal.Length, Petal.Width))
plot(pd)

Lift Curves

## Requires a binary outcome
fo_versicolor <- factor(Species == "versicolor") ~ .
control = CVControl()

gbmres_versicolor <- resample(fo_versicolor, data = iris,  model = GBMModel, control = control)
lf <- lift(gbmres_versicolor)
plot(lf)

rfres_versicolor <- resample(fo_versicolor, data = iris,  model = RandomForestModel, control = control)
nnetres_versicolor <- resample(fo_versicolor, data = iris,  model = NNetModel, control = control)

res_versicolor <- Resamples(gbmres_versicolor, rfres_versicolor, nnetres_versicolor)
lf <- lift(res_versicolor)
plot(lf, find = 75)

Preprocessing Recipes

library(recipes)

rec <- recipe(fo, data = iris) %>%
  add_role(Species, new_role = "case_strata") %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  step_pca(all_predictors())

fit_rec <- fit(rec, model = GBMModel)
varimp(fit_rec)
#>        Overall
#> PC1 100.000000
#> PC3   4.946313
#> PC2   1.089649
#> PC4   0.000000

res_rec <- resample(rec, model = GBMModel, control = CVControl)
summary(res_rec)
#>                Mean     Median         SD         Min       Max NA
#> Accuracy 0.96000000 0.96666667 0.04661373 0.866666667 1.0000000  0
#> Kappa    0.94000000 0.95000000 0.06992059 0.800000000 1.0000000  0
#> ROCAUC   0.99866667 1.00000000 0.00421637 0.986666667 1.0000000  0
#> Brier    0.06446083 0.04011043 0.05647578 0.003933401 0.1399515  0