Once the data set is ready for model development, the model is fitted, predicted and evaluated in the following ways:
The alookr package makes these steps fast and easy:
BreastCancer
of mlbench package
is a breast cancer data. The objective is to identify each of a number of benign or malignant classes.
A data frame with 699 observations on 11 variables, one being a character variable, 9 being ordered or nominal, and 1 target class.:
Id
: character. Sample code numberCl.thickness
: ordered factor. Clump ThicknessCell.size
: ordered factor. Uniformity of Cell SizeCell.shape
: ordered factor. Uniformity of Cell ShapeMarg.adhesion
: ordered factor. Marginal AdhesionEpith.c.size
: ordered factor. Single Epithelial Cell SizeBare.nuclei
: factor. Bare NucleiBl.cromatin
: factor. Bland ChromatinNormal.nucleoli
: factor. Normal NucleoliMitoses
: factor. MitosesClass
: factor. Class. level is benign
and malignant
.library(mlbench)
data(BreastCancer)
# class of each variables
sapply(BreastCancer, function(x) class(x)[1])
Id Cl.thickness Cell.size Cell.shape Marg.adhesion "character" "ordered" "ordered" "ordered" "ordered"
Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses "ordered" "factor" "factor" "factor" "factor"
Class "factor"
Perform data preprocessing as follows.:
dlookr::imputate_na()
find the variables that include missing value. and imputate the missing value using imputate_na() in dlookr package.
library(dlookr)
library(dplyr)
# variable that have a missing value
diagnose(BreastCancer) %>%
filter(missing_count > 0)
# A tibble: 1 x 6
variables types missing_count missing_percent unique_count unique_rate<chr> <chr> <int> <dbl> <int> <dbl>
1 Bare.nuclei factor 16 2.29 11 0.0157
# imputation of missing value
<- BreastCancer %>%
breastCancer mutate(Bare.nuclei = imputate_na(BreastCancer, Bare.nuclei, Class,
method = "mice", no_attrs = TRUE, print_flag = FALSE))
split_by()
split_by()
in the alookr package splits the dataset into a train set and a test set.
The ratio argument of the split_by()
function specifies the ratio of the train set.
split_by()
creates a class object named split_df.
library(alookr)
# split the data into a train set and a test set by default arguments
<- breastCancer %>%
sb split_by(target = Class)
# show the class name
class(sb)
1] "split_df" "grouped_df" "tbl_df" "tbl" "data.frame"
[
# split the data into a train set and a test set by ratio = 0.6
<- breastCancer %>%
tmp split_by(Class, ratio = 0.6)
The summary()
function displays the following useful information about the split_df object:
# summary() display the some information
summary(sb)
** Split train/test set information **
+ random seed : 32364
+ split data
- train set count : 489
- test set count : 210
+ target variable : Class
- minority class : malignant (0.344778)
- majority class : benign (0.655222)
# summary() display the some information
summary(tmp)
** Split train/test set information **
+ random seed : 27827
+ split data
- train set count : 419
- test set count : 280
+ target variable : Class
- minority class : malignant (0.344778)
- majority class : benign (0.655222)
In the case of categorical variables, when a train set and a test set are separated, a specific level may be missing from the train set.
In this case, there is no problem when fitting the model, but an error occurs when predicting with the model you created. Therefore, preprocessing is performed to avoid missing data preprocessing.
In the following example, fortunately, there is no categorical variable that contains the missing levels in the train set.
# list of categorical variables in the train set that contain missing levels
<- sb %>%
nolevel_in_train compare_target_category() %>%
filter(is.na(train)) %>%
select(variable) %>%
unique() %>%
pull
nolevel_in_traincharacter(0)
# if any of the categorical variables in the train set contain a missing level,
# split them again.
while (length(nolevel_in_train) > 0) {
<- breastCancer %>%
sb split_by(Class)
<- sb %>%
nolevel_in_train compare_target_category() %>%
filter(is.na(train)) %>%
select(variable) %>%
unique() %>%
pull }
sampling_target()
Imbalanced classes(levels) data means that the number of one level of the frequency of the target variable is relatively small. In general, the proportion of positive classes is relatively small. For example, in the model of predicting spam, the class of interest spam is less than non-spam.
Imbalanced classes data is a common problem in machine learning classification.
table()
and prop.table()
are traditionally useful functions for diagnosing imbalanced classes data. However, alookr’s summary()
is simpler and provides more information.
# train set frequency table - imbalanced classes data
table(sb$Class)
benign malignant 458 241
# train set relative frequency table - imbalanced classes data
prop.table(table(sb$Class))
benign malignant 0.6552217 0.3447783
# using summary function - imbalanced classes data
summary(sb)
** Split train/test set information **
+ random seed : 32364
+ split data
- train set count : 489
- test set count : 210
+ target variable : Class
- minority class : malignant (0.344778)
- majority class : benign (0.655222)
Most machine learning algorithms work best when the number of samples in each class are about equal. And most algorithms are designed to maximize accuracy and reduce error. So, we requre handling an imbalanced class problem.
sampling_target() performs sampling to solve an imbalanced classes data problem.
Oversampling can be defined as adding more copies of the minority class.
Oversampling is performed by specifying “ubOver” in the method argument of the sampling_target()
function.
# to balanced by over sampling
<- sb %>%
train_over sampling_target(method = "ubOver")
# frequency table
table(train_over$Class)
benign malignant 312 312
Undersampling can be defined as removing some observations of the majority class.
Undersampling is performed by specifying “ubUnder” in the method argument of the sampling_target()
function.
# to balanced by under sampling
<- sb %>%
train_under sampling_target(method = "ubUnder")
# frequency table
table(train_under$Class)
benign malignant 177 177
SMOTE(Synthetic Minority Oversampling Technique) uses a nearest neighbors algorithm to generate new and synthetic data.
SMOTE is performed by specifying “ubSMOTE” in the method argument of the sampling_target()
function.
# to balanced by SMOTE
<- sb %>%
train_smote sampling_target(seed = 1234L, method = "ubSMOTE")
# frequency table
table(train_smote$Class)
benign malignant 708 531
cleanse()
The cleanse()
cleanse the dataset for classification modeling.
This function is useful when fit the classification model. This function does the following.:
In this example, The cleanse()
function removed a variable ID with a high unique rate.
# clean the training set
<- train_smote %>%
train
cleanse
─ Checking unique value ────────────── unique value is one ─
No variables that unique value is one.
─ Checking unique rate ──────────────── high unique rate ─
remove variables with high unique rate= 431(0.347861178369653)
● Id
─ Checking character variables ──────────── categorical data ─ No character variables.
extract_set()
# extract test set
<- sb %>%
test extract_set(set = "test")
run_models()
run_models()
performs some representative binary classification modeling using split_df
object created by split_by()
.
run_models()
executes the process in parallel when fitting the model. However, it is not supported in MS-Windows operating system and RStudio environment.
Currently supported algorithms are as follows.:
stats
packagerpart
packageparty
packagerandomForest
packageranger
packagexgboost
packagerun_models()
returns a model_df
class object.
The model_df
class object contains the following variables.:
run_models()
, the value of the variable is “1.Fitted”.<- train %>%
result run_models(target = "Class", positive = "malignant")
result# A tibble: 6 x 7
step model_id target is_factor positive negative fitted_model<chr> <chr> <chr> <lgl> <chr> <chr> <list>
1 1.Fitted logistic Class TRUE malignant benign <glm>
2 1.Fitted rpart Class TRUE malignant benign <rpart>
3 1.Fitted ctree Class TRUE malignant benign <BinaryTr>
4 1.Fitted randomForest Class TRUE malignant benign <rndmFrs.>
5 1.Fitted ranger Class TRUE malignant benign <ranger>
6 1.Fitted xgboost Class TRUE malignant benign <xgb.Bstr>
Evaluate the predictive performance of fitted models.
run_predict()
run_predict()
predict the test set using model_df
class fitted by run_models()
.
run_predict ()
is executed in parallel when predicting by model. However, it is not supported in MS-Windows operating system and RStudio environment.
The model_df
class object contains the following variables.:
run_predict()
, the value of the variable is “2.Predicted”.<- result %>%
pred run_predict(test)
pred# A tibble: 6 x 8
step model_id target is_factor positive negative fitted_model predicted<chr> <chr> <chr> <lgl> <chr> <chr> <list> <list>
1 2.Predi… logistic Class TRUE malignant benign <glm> <fct [21…
2 2.Predi… rpart Class TRUE malignant benign <rpart> <fct [21…
3 2.Predi… ctree Class TRUE malignant benign <BinaryTr> <fct [21…
4 2.Predi… randomFor… Class TRUE malignant benign <rndmFrs.> <fct [21…
5 2.Predi… ranger Class TRUE malignant benign <ranger> <fct [21…
6 2.Predi… xgboost Class TRUE malignant benign <xgb.Bstr> <fct [21…
run_performance()
run_performance()
calculate the performance metric of model_df
class predicted by run_predict()
.
run_performance ()
is performed in parallel when calculating the performance evaluation metrics However, it is not supported in MS-Windows operating system and RStudio environment.
The model_df
class object contains the following variables.:
run_performance()
, the value of the variable is “3.Performanced”.# Calculate performace metrics.
<- run_performance(pred)
perf
perf# A tibble: 6 x 7
step model_id target positive fitted_model predicted performance<chr> <chr> <chr> <chr> <list> <list> <list>
1 3.Performanc… logistic Class maligna… <glm> <fct [210… <dbl [15]>
2 3.Performanc… rpart Class maligna… <rpart> <fct [210… <dbl [15]>
3 3.Performanc… ctree Class maligna… <BinaryTr> <fct [210… <dbl [15]>
4 3.Performanc… randomForest Class maligna… <rndmFrs.> <fct [210… <dbl [15]>
5 3.Performanc… ranger Class maligna… <ranger> <fct [210… <dbl [15]>
6 3.Performanc… xgboost Class maligna… <xgb.Bstr> <fct [210… <dbl [15]>
The performance variable contains a list object, which contains 15 performance metrics:
# Performance by analytics models
<- perf$performance
performance names(performance) <- perf$model_id
performance$logistic
ZeroOneLoss Accuracy Precision Recall Sensitivity Specificity 0.06190476 0.93809524 0.85915493 0.95312500 0.95312500 0.93150685
F1_Score Fbeta_Score LogLoss AUC Gini PRAUC 0.90370370 0.90370370 2.13815281 0.94167380 0.88762842 0.02620536
LiftAUC GainAUC KS_Stat 1.23148426 0.80706845 88.46318493
$rpart
ZeroOneLoss Accuracy Precision Recall Sensitivity Specificity 0.05238095 0.94761905 0.90769231 0.92187500 0.92187500 0.95890411
F1_Score Fbeta_Score LogLoss AUC Gini PRAUC 0.91472868 0.91472868 0.21225380 0.94750642 0.90517979 0.89276781
LiftAUC GainAUC KS_Stat 2.20619128 0.81112351 88.07791096
$ctree
ZeroOneLoss Accuracy Precision Recall Sensitivity Specificity 0.05238095 0.94761905 0.90769231 0.92187500 0.92187500 0.95890411
F1_Score Fbeta_Score LogLoss AUC Gini PRAUC 0.91472868 0.91472868 0.93401643 0.95649615 0.91588185 0.64694926
LiftAUC GainAUC KS_Stat 1.95823369 0.81737351 89.83304795
$randomForest
ZeroOneLoss Accuracy Precision Recall Sensitivity Specificity 0.04285714 0.95714286 0.91044776 0.95312500 0.95312500 0.95890411
F1_Score Fbeta_Score LogLoss AUC Gini PRAUC 0.93129771 0.93129771 0.11905365 0.98983305 0.97988014 0.69157170
LiftAUC GainAUC KS_Stat 1.96479234 0.84055060 94.52054795
$ranger
ZeroOneLoss Accuracy Precision Recall Sensitivity Specificity 0.03809524 0.96190476 0.92424242 0.95312500 0.95312500 0.96575342
F1_Score Fbeta_Score LogLoss AUC Gini PRAUC 0.93846154 0.93846154 0.10293743 0.99090325 0.98180651 0.89824963
LiftAUC GainAUC KS_Stat 2.12464131 0.84129464 95.20547945
$xgboost
ZeroOneLoss Accuracy Precision Recall Sensitivity Specificity 0.04285714 0.95714286 0.92307692 0.93750000 0.93750000 0.96575342
F1_Score Fbeta_Score LogLoss AUC Gini PRAUC 0.93023256 0.93023256 0.13026789 0.98651541 0.97281678 0.88623641
LiftAUC GainAUC KS_Stat 2.13764399 0.83824405 95.20547945
If you change the list object to tidy format, you’ll see the following at a glance:
# Convert to matrix for compare performace.
sapply(performance, "c")
logistic rpart ctree randomForest ranger0.06190476 0.05238095 0.05238095 0.04285714 0.03809524
ZeroOneLoss 0.93809524 0.94761905 0.94761905 0.95714286 0.96190476
Accuracy 0.85915493 0.90769231 0.90769231 0.91044776 0.92424242
Precision 0.95312500 0.92187500 0.92187500 0.95312500 0.95312500
Recall 0.95312500 0.92187500 0.92187500 0.95312500 0.95312500
Sensitivity 0.93150685 0.95890411 0.95890411 0.95890411 0.96575342
Specificity 0.90370370 0.91472868 0.91472868 0.93129771 0.93846154
F1_Score 0.90370370 0.91472868 0.91472868 0.93129771 0.93846154
Fbeta_Score 2.13815281 0.21225380 0.93401643 0.11905365 0.10293743
LogLoss 0.94167380 0.94750642 0.95649615 0.98983305 0.99090325
AUC 0.88762842 0.90517979 0.91588185 0.97988014 0.98180651
Gini 0.02620536 0.89276781 0.64694926 0.69157170 0.89824963
PRAUC 1.23148426 2.20619128 1.95823369 1.96479234 2.12464131
LiftAUC 0.80706845 0.81112351 0.81737351 0.84055060 0.84129464
GainAUC 88.46318493 88.07791096 89.83304795 94.52054795 95.20547945
KS_Stat
xgboost0.04285714
ZeroOneLoss 0.95714286
Accuracy 0.92307692
Precision 0.93750000
Recall 0.93750000
Sensitivity 0.96575342
Specificity 0.93023256
F1_Score 0.93023256
Fbeta_Score 0.13026789
LogLoss 0.98651541
AUC 0.97281678
Gini 0.88623641
PRAUC 2.13764399
LiftAUC 0.83824405
GainAUC 95.20547945 KS_Stat
compare_performance()
return a list object(results of compared model performance). and list has the following components:
In this example, compare_performance()
recommend the “ranger” model.
# Compaire the Performance metrics of each model
<- compare_performance(pred)
comp_perf
comp_perf$recommend_model
1] "ranger"
[
$top_metric_count
logistic rpart ctree randomForest ranger xgboost 1 1 0 1 12 2
$mean_rank
logistic rpart ctree randomForest ranger xgboost 5.615385 4.269231 4.500000 2.692308 1.307692 2.615385
$top_metric
$top_metric$logistic
1] "Recall"
[
$top_metric$rpart
1] "LiftAUC"
[
$top_metric$ctree
NULL
$top_metric$randomForest
1] "Recall"
[
$top_metric$ranger
1] "ZeroOneLoss" "Accuracy" "Precision" "Recall" "Specificity"
[6] "F1_Score" "LogLoss" "AUC" "Gini" "PRAUC"
[11] "GainAUC" "KS_Stat"
[
$top_metric$xgboost
1] "Specificity" "KS_Stat" [
plot_performance()
compare_performance()
plot ROC curve.
# Plot ROC curve
plot_performance(pred)
In general, if the prediction probability is greater than 0.5 in the binary classification model, it is predicted as positive class
. In other words, 0.5 is used for the cut-off value. This applies to most model algorithms. However, in some cases, the performance can be tuned by changing the cut-off value.
plot_cutoff ()
visualizes a plot to select the cut-off value, and returns the cut-off value.
<- pred %>%
pred_best filter(model_id == comp_perf$recommend_model) %>%
select(predicted) %>%
%>%
pull 1]] %>%
.[[attr("pred_prob")
<- plot_cutoff(pred_best, test$Class, "malignant", type = "mcc") cutoff
cutoff1] 0.38
[
<- plot_cutoff(pred_best, test$Class, "malignant", type = "density") cutoff2
cutoff21] 0.9804
[
<- plot_cutoff(pred_best, test$Class, "malignant", type = "prob") cutoff3
cutoff31] 0.38 [
performance_metric()
Compare the performance of the original prediction with that of the tuned cut-off. Compare the cut-off with the non-cut model for the model with the best performance comp_perf$recommend_model
.
$recommend_model
comp_perf1] "ranger"
[
# extract predicted probability
<- which(pred$model_id == comp_perf$recommend_model)
idx <- attr(pred$predicted[[idx]], "pred_prob")
pred_prob
# or, extract predicted probability using dplyr
<- pred %>%
pred_prob filter(model_id == comp_perf$recommend_model) %>%
select(predicted) %>%
%>%
pull "[["(1) %>%
attr("pred_prob")
# predicted probability
pred_prob 1] 0.000000e+00 8.956992e-01 0.000000e+00 0.000000e+00 4.510317e-03
[6] 0.000000e+00 9.090730e-01 9.968857e-01 2.380952e-05 0.000000e+00
[11] 0.000000e+00 5.457875e-05 0.000000e+00 9.993333e-01 9.576333e-01
[16] 9.957397e-01 9.968857e-01 7.762913e-01 8.746698e-01 0.000000e+00
[21] 3.981810e-01 9.692762e-01 9.166063e-01 1.718810e-02 2.644444e-03
[26] 6.176984e-03 0.000000e+00 0.000000e+00 1.891317e-01 0.000000e+00
[31] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
[36] 0.000000e+00 9.941833e-01 9.921389e-01 0.000000e+00 9.251929e-01
[41] 5.418254e-03 2.289040e-01 9.064603e-01 7.496825e-03 0.000000e+00
[46] 1.000000e+00 9.524913e-01 0.000000e+00 0.000000e+00 4.177778e-02
[51] 0.000000e+00 0.000000e+00 9.759103e-01 0.000000e+00 0.000000e+00
[56] 9.725032e-01 4.195738e-01 0.000000e+00 3.135714e-03 5.457875e-05
[61] 9.772500e-01 0.000000e+00 0.000000e+00 0.000000e+00 9.771071e-01
[66] 5.457875e-05 5.457875e-05 9.968333e-01 9.782214e-01 9.988000e-01
[71] 1.964183e-01 7.760404e-02 0.000000e+00 9.369984e-01 1.647016e-01
[76] 9.995278e-01 9.911000e-01 0.000000e+00 9.071000e-01 9.993500e-01
[81] 9.988000e-01 8.822563e-01 0.000000e+00 9.813341e-01 8.822563e-01
[86] 8.161373e-01 0.000000e+00 0.000000e+00 9.251659e-01 9.994286e-01
[91] 9.913381e-01 9.497095e-01 4.665143e-01 7.142825e-01 1.000000e+00
[96] 0.000000e+00 0.000000e+00 9.361508e-01 0.000000e+00 5.457875e-05
[101] 8.900730e-01 9.418556e-01 1.409167e-01 3.884810e-01 9.388579e-01
[106] 9.971056e-01 9.838714e-01 0.000000e+00 5.457875e-05 0.000000e+00
[111] 1.551230e-01 0.000000e+00 7.266127e-01 6.535778e-01 1.000000e+00
[116] 0.000000e+00 9.986032e-01 0.000000e+00 3.011111e-03 0.000000e+00
[121] 0.000000e+00 1.297381e-02 0.000000e+00 9.953889e-01 0.000000e+00
[126] 0.000000e+00 9.992063e-01 0.000000e+00 0.000000e+00 0.000000e+00
[131] 9.991429e-01 0.000000e+00 9.510087e-01 2.837349e-01 0.000000e+00
[136] 9.992921e-01 9.943722e-01 0.000000e+00 1.179206e-02 0.000000e+00
[141] 0.000000e+00 9.224413e-01 0.000000e+00 0.000000e+00 1.323437e-01
[146] 0.000000e+00 9.899183e-01 4.314390e-01 0.000000e+00 5.457875e-05
[151] 5.441270e-03 0.000000e+00 0.000000e+00 2.027269e-04 1.000000e+00
[156] 0.000000e+00 3.548793e-02 5.457875e-05 0.000000e+00 0.000000e+00
[161] 0.000000e+00 9.966286e-01 0.000000e+00 0.000000e+00 0.000000e+00
[166] 0.000000e+00 0.000000e+00 0.000000e+00 1.116667e-03 5.457875e-05
[171] 0.000000e+00 0.000000e+00 0.000000e+00 5.457875e-05 0.000000e+00
[176] 0.000000e+00 2.380952e-05 0.000000e+00 0.000000e+00 0.000000e+00
[181] 5.208492e-02 9.949151e-01 0.000000e+00 0.000000e+00 0.000000e+00
[186] 9.987333e-01 0.000000e+00 3.877619e-02 0.000000e+00 0.000000e+00
[191] 0.000000e+00 0.000000e+00 0.000000e+00 9.989921e-01 5.457875e-05
[196] 1.326817e-01 0.000000e+00 1.350404e-02 1.029524e-02 0.000000e+00
[201] 0.000000e+00 9.959373e-01 0.000000e+00 0.000000e+00 0.000000e+00
[206] 0.000000e+00 7.363492e-03 0.000000e+00 1.000000e+00 8.456063e-01
[
# compaire Accuracy
performance_metric(pred_prob, test$Class, "malignant", "Accuracy")
1] 0.9619048
[performance_metric(pred_prob, test$Class, "malignant", "Accuracy",
cutoff = cutoff)
1] 0.9666667
[
# compaire Confusion Matrix
performance_metric(pred_prob, test$Class, "malignant", "ConfusionMatrix")
actual
predict benign malignant141 3
benign 5 61
malignant performance_metric(pred_prob, test$Class, "malignant", "ConfusionMatrix",
cutoff = cutoff)
actual
predict benign malignant139 0
benign 7 64
malignant
# compaire F1 Score
performance_metric(pred_prob, test$Class, "malignant", "F1_Score")
1] 0.9384615
[performance_metric(pred_prob, test$Class, "malignant", "F1_Score",
cutoff = cutoff)
1] 0.9481481
[performance_metric(pred_prob, test$Class, "malignant", "F1_Score",
cutoff = cutoff2)
1] 0.6938776 [
If the performance of the tuned cut-off is good, use it as a cut-off to predict positives.
If you have selected a good model from several models, then perform the prediction with that model.
Create sample data for predicting by extracting 100 samples from the data set used in the previous under sampling example.
<- train_under %>%
data_pred
cleanse
─ Checking unique value ────────────── unique value is one ─
No variables that unique value is one.
─ Checking unique rate ──────────────── high unique rate ─
remove variables with high unique rate= 346(0.977401129943503)
● Id
─ Checking character variables ──────────── categorical data ─
No character variables.
set.seed(1234L)
<- data_pred %>%
data_pred %>%
nrow %>%
seq sample(size = 50) %>%
data_pred[., ]
Do a predict using the dplyr
package. The last factor()
function eliminates unnecessary information.
<- pred %>%
pred_actual filter(model_id == comp_perf$recommend_model) %>%
run_predict(data_pred) %>%
select(predicted) %>%
%>%
pull "[["(1) %>%
factor()
pred_actual1] malignant benign benign benign benign benign malignant
[8] benign benign malignant malignant malignant malignant benign
[15] malignant malignant benign benign malignant benign benign
[22] malignant malignant benign benign malignant malignant malignant
[29] benign malignant benign benign malignant benign malignant
[36] benign malignant benign benign malignant benign malignant
[43] benign benign benign malignant benign malignant benign
[50] benign
[: benign malignant Levels
If you want to predict by cut-off, specify the cutoff
argument in the run_predict()
function as follows.:
In the example, there is no difference between the results of using cut-off and not.
<- pred %>%
pred_actual2 filter(model_id == comp_perf$recommend_model) %>%
run_predict(data_pred, cutoff) %>%
select(predicted) %>%
%>%
pull "[["(1) %>%
factor()
pred_actual21] malignant benign benign benign benign benign malignant
[8] benign benign malignant malignant malignant malignant benign
[15] malignant malignant benign benign malignant benign benign
[22] malignant malignant benign benign malignant malignant malignant
[29] benign malignant benign benign malignant benign malignant
[36] benign malignant benign benign malignant benign malignant
[43] benign benign benign malignant benign malignant benign
[50] benign
[: benign malignant
Levels
sum(pred_actual != pred_actual2)
1] 0 [