Abstract

In this vignette, we use **repeated cross-validation** to tune the hyperparameters of a **custom model function** with `cross_validate_fn()`

. Additionally, we learn to preprocess the training and test folds within the cross-validation.

Examples of model functions, predict functions and preprocess functions are available in `model_functions()`

, `predict_functions()`

, and `preprocess_functions()`

. These can be used directly or as starting points.

Contact the author at r-pkgs@ludvigolsen.dk

Where `cross_validate()`

only allows us to cross-validate the `lm()`

, `lmer()`

, `glm()`

, and `glmer()`

model functions, `cross_validate_fn()`

can cross-validate **any** model function that can be wrapped in a specific function and predict new data. This means, we can cross-validate *support-vector machines*, *neural networks*, *naïve Bayes*, etc., and compare them.

As these model functions are all a bit different and return predictions in different formats, we need to wrap them in a specific set of functions that `cross_validate_fn()`

knows how to deal with. That requires a bit more work than using `cross_validate()`

but is very flexible.

In this vignette, we will learn how to convert our model to a model function and associated predict function. After doing this once, you can save them in a separate R script and reuse them in future projects.

Once you’ve completed this tutorial, you will be able to:

- Convert your model to a model function and predict function.
- Cross-validate your custom model function.
- Tune the hyperparameters of your model.
- Preprocess the training and test sets within the cross-validation.

We start by attaching the needed packages and setting a random seed for reproducibility:

```
library(cvms)
library(groupdata2) # fold()
library(dplyr)
library(knitr) # kable() : formats the output as a table
library(e1071) # svm()
set.seed(1)
```

We could enable *parallelization* to speed up the `fold()`

and `cross_validate_fn()`

functions. Uncomment the below and set `parallel = TRUE`

when calling those functions.

For the regression and binary classification examples, we will use the `participant.scores`

dataset from `cvms`

. It features 10 participants who partook in an incredibly fictional study with three sessions of a task. Some of the participants have the diagnosis ** 1** (sounds scary!) and they all got a score after each session.

In order to use the data for cross-validation, we need to divide it into folds. For this we use the `fold()`

function from `groupdata2`

. We create 3 folds as it is a small dataset. With more data, it is common to use 10 folds (although that is a bit arbitrary).

As the randomness in this data splitting process can influence the model comparison, we do it multiple times and average the results. This is called **repeated cross-validation**. By setting `num_fold_cols = 5`

, `fold()`

will create 5 *unique* fold columns. Unless your model takes a long time to fit, it’s common to use 10-100 repetitions, but we pick 5 to speed up the process for this tutorial. Remember, that you can enable parallelization to utilize multiple CPU cores.

By setting `cat_col = "diagnosis"`

, we ensure a similar ratio of participants with and without the diagnosis in all the folds.

By setting `id_col = "participant"`

, we ensure that all the rows belonging to a participant are put in the same fold. If we do not ensure this, we could be testing on a participant we also trained on, which is cheating! :)

```
# Prepare dataset
data <- participant.scores
data$diagnosis <- factor(data$diagnosis)
# Create 5 fold columns with 3 folds each
data <- fold(
data,
k = 3,
cat_col = "diagnosis",
id_col = "participant",
num_fold_cols = 5,
parallel = FALSE # set to TRUE to run in parallel
)
# Order by participant
data <- data %>%
dplyr::arrange(participant)
# Look at the first 12 rows
# Note: kable() just formats the table
data %>% head(12) %>% kable()
```

participant | age | diagnosis | score | session | .folds_1 | .folds_2 | .folds_3 | .folds_4 | .folds_5 |
---|---|---|---|---|---|---|---|---|---|

1 | 20 | 1 | 10 | 1 | 3 | 3 | 3 | 2 | 3 |

1 | 20 | 1 | 24 | 2 | 3 | 3 | 3 | 2 | 3 |

1 | 20 | 1 | 45 | 3 | 3 | 3 | 3 | 2 | 3 |

2 | 23 | 0 | 24 | 1 | 1 | 3 | 3 | 3 | 2 |

2 | 23 | 0 | 40 | 2 | 1 | 3 | 3 | 3 | 2 |

2 | 23 | 0 | 67 | 3 | 1 | 3 | 3 | 3 | 2 |

3 | 27 | 1 | 15 | 1 | 2 | 1 | 3 | 1 | 1 |

3 | 27 | 1 | 30 | 2 | 2 | 1 | 3 | 1 | 1 |

3 | 27 | 1 | 40 | 3 | 2 | 1 | 3 | 1 | 1 |

4 | 21 | 0 | 35 | 1 | 3 | 3 | 3 | 1 | 3 |

4 | 21 | 0 | 50 | 2 | 3 | 3 | 3 | 1 | 3 |

4 | 21 | 0 | 78 | 3 | 3 | 3 | 3 | 1 | 3 |

We can check that the `cat_col`

and `id_col`

arguments did their thing:

```
# Check the distribution of 'diagnosis' in the first fold column
# Note: this would be more even for a larger dataset
data %>%
dplyr::count(.folds_1, diagnosis) %>%
kable()
```

.folds_1 | diagnosis | n |
---|---|---|

1 | 0 | 3 |

1 | 1 | 6 |

2 | 0 | 3 |

2 | 1 | 6 |

3 | 0 | 6 |

3 | 1 | 6 |

```
# Check the distribution of 'participant' in the first fold column
# Note that all rows for a participant are in the same fold
data %>%
dplyr::count(.folds_1, participant) %>%
kable()
```

.folds_1 | participant | n |
---|---|---|

1 | 2 | 3 |

1 | 5 | 3 |

1 | 7 | 3 |

2 | 3 | 3 |

2 | 6 | 3 |

2 | 10 | 3 |

3 | 1 | 3 |

3 | 4 | 3 |

3 | 8 | 3 |

3 | 9 | 3 |

Now that we have the data ready, we can try to predict the `score`

with a support-vector machine (SVM). While building the analysis, we will use the third fold in the first fold column as the test set and the rest as training set. The SVM will be fitted on the training set and used to predict the score in the test set. Finally, we will evaluate the predictions with `evaluate()`

:

```
# Split into train and test sets
test_set <- data %>%
dplyr::filter(.folds_1 == 3)
train_set <- data %>%
dplyr::filter(.folds_1 != 3)
# Fit SVM model
svm_model <- e1071::svm(
formula = score ~ diagnosis + age + session,
data = train_set,
kernel = "linear",
cost = 10,
type = "eps-regression"
)
# Predict scores in the test set
predicted_scores <- predict(
svm_model,
newdata = test_set,
allow.new.levels = TRUE)
predicted_scores
#> 1 2 3 4 5 6 7 8
#> 9.412735 27.545911 45.679086 28.305468 46.438644 64.571820 9.652563 27.785739
#> 9 10 11 12
#> 45.918915 31.423234 49.556410 67.689586
# Add predictions to test set
test_set[["predicted score"]] <- predicted_scores
# Evaluate the predictions
evaluate(
data = test_set,
target_col = "score",
prediction_cols = "predicted score",
type = "gaussian"
)
#> # A tibble: 1 x 8
#> RMSE MAE `NRMSE(IQR)` RRSE RAE RMSLE Predictions Process
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
#> 1 5.25 3.97 0.253 0.277 0.256 0.171 <tibble [12 × 2]> <prcss_n_>
```

The Mean Absolute Error (`MAE`

) metric tells us that predictions `3.97`

off on average. That’s only for one fold though, so next we will convert the model to a model function, and the `predict()`

call to a predict function, that can be used within `cross_validate_fn()`

.

A model function for `cross_validate_fn()`

has the arguments `train_data`

, `formula`

and `hyperparameters`

. We don’t need to use all of them, but those are the inputs it will receive when called inside `cross_validate_fn()`

.

To convert our model, we wrap it like this:

```
svm_model_fn <- function(train_data, formula, hyperparameters) {
e1071::svm(
formula = formula,
data = train_data,
kernel = "linear",
cost = 10,
type = "eps-regression"
)
}
```

**Note**: In R, the last thing in a function is returned. This means we don’t need to use `return()`

in the above. Feel free to add it though.

We can test the model function on the training set:

```
# Try the model function
# Due to "lazy evaluation" in R, we don't have to pass
# the arguments that are not used inside the function
m0 <- svm_model_fn(train_data = train_set,
formula = score ~ diagnosis + age + session)
m0
#>
#> Call:
#> svm(formula = formula, data = train_data, kernel = "linear", cost = 10,
#> type = "eps-regression")
#>
#>
#> Parameters:
#> SVM-Type: eps-regression
#> SVM-Kernel: linear
#> cost: 10
#> gamma: 0.25
#> epsilon: 0.1
#>
#>
#> Number of Support Vectors: 17
```

The `predict()`

function varies a bit for different models, so we need to supply a predict function that works with our model function and returns the predictions in the right format. Which format is correct depends on the kind of task we are performing. For regression (`'gaussian'`

), the predictions should be a **single vector with the predicted values**.

The predict function must have the arguments `test_data`

, `model`

, `formula`

, `hyperparameters`

, and `train_data`

. Again, we don’t need to use all of them inside the function, but they must be there.

We can convert our `predict()`

call to a predict function like so:

```
svm_predict_fn <- function(test_data, model, formula, hyperparameters, train_data) {
predict(object = model,
newdata = test_data,
allow.new.levels = TRUE)
}
# Try the predict function
svm_predict_fn(test_data = test_set, model = m0)
#> 1 2 3 4 5 6 7 8
#> 9.412735 27.545911 45.679086 28.305468 46.438644 64.571820 9.652563 27.785739
#> 9 10 11 12
#> 45.918915 31.423234 49.556410 67.689586
```

Now, we can cross-validate our model function!

We will cross-validate a couple of formulas at once. These are passed as strings and converted to formula objects internally. We also supply the dataset with the fold columns and the names of the fold columns (`".folds_1"`

, `".folds_2"`

, etc.).

```
cv_1 <- cross_validate_fn(
data = data,
formulas = c("score ~ diagnosis + age + session",
"score ~ diagnosis + age",
"score ~ diagnosis"),
type = "gaussian",
model_fn = svm_model_fn,
predict_fn = svm_predict_fn,
fold_cols = paste0(".folds_", 1:5),
parallel = FALSE # set to TRUE to run in parallel
)
#> Will cross-validate 3 models. This requires fitting 45 model instances.
#> Registered S3 method overwritten by 'broom.mixed':
#> method from
#> tidy.gamlss broom
cv_1
#> # A tibble: 3 x 17
#> Fixed RMSE MAE `NRMSE(IQR)` RRSE RAE RMSLE Predictions Results
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
#> 1 diag… 10.5 8.25 0.543 0.571 0.561 0.298 <tibble [1… <tibbl…
#> 2 diag… 17.9 15.0 0.905 0.973 1.01 0.506 <tibble [1… <tibbl…
#> 3 diag… 16.7 14.3 0.828 0.902 0.947 0.486 <tibble [1… <tibbl…
#> # … with 8 more variables: Coefficients <list>, Folds <int>, `Fold
#> # Columns` <int>, `Convergence Warnings` <int>, `Other Warnings` <int>,
#> # `Warnings and Messages` <list>, Process <list>, Dependent <chr>
```

The first formula has the lowest Root Mean Square Error (`RMSE`

). Before learning how to inspect the output, let’s enable hyperparameters, so we can try different kernels and costs.

Hyperparameters are settings for the model function that we can tweak to get better performance. The SVM model has multiple of these settings that we can play with, like the `kernel`

and `cost`

settings.

The hyperparameters passed to the model function can be indexed with `[["name"]]`

. This means we can get the kernel for the current model instance with `hyperparameters[["kernel"]]`

. In case the user forgets to pass a kernel in the hyperparameters, we need to check if it’s available though. Here’s an example:

```
svm_model_fn <- function(train_data, formula, hyperparameters) {
# Required hyperparameters:
# - kernel
# - cost
if (!"kernel" %in% names(hyperparameters))
stop("'hyperparameters' must include 'kernel'")
if (!"cost" %in% names(hyperparameters))
stop("'hyperparameters' must include 'cost'")
e1071::svm(
formula = formula,
data = train_data,
kernel = hyperparameters[["kernel"]],
cost = hyperparameters[["cost"]],
scale = FALSE,
type = "eps-regression"
)
}
# Try the model function
svm_model_fn(
train_data = train_set,
formula = score ~ diagnosis + age + session,
hyperparameters = list(
"kernel" = "linear",
"cost" = 5
)
)
#>
#> Call:
#> svm(formula = formula, data = train_data, kernel = hyperparameters[["kernel"]],
#> cost = hyperparameters[["cost"]], type = "eps-regression", scale = FALSE)
#>
#>
#> Parameters:
#> SVM-Type: eps-regression
#> SVM-Kernel: linear
#> cost: 5
#> gamma: 0.25
#> epsilon: 0.1
#>
#>
#> Number of Support Vectors: 18
```

When we have a lot of hyperparameters, we quickly get a lot of `if`

statements for performing those checks. We may also wish to provide default values when a hyperparameter is not passed by the user. For this purpose, the `update_hyperparameters()`

function is available. It checks that required hyperparameters are present and set default values for those that are *not required* and were not passed by the user.

There are three parts to the inputs to `update_hyperparameters()`

:

The

**default**hyperparameter values (*passed first*). E.g. if we wish to set the default for the`kernel`

parameter to`"radial"`

, we simply pass`kernel = "radial"`

. When the user doesn’t pass a`kernel`

setting, this default value is used.The list of hyperparameters. Remember to name the argument when passing it, i.e.:

`hyperparameters = hyperparameters`

.The names of the

**required**hyperparameters. If any of these are not in the hyperparameters, an error is thrown. Remember to name the argument when passing it, e.g.:`.required = c("cost", "scale")`

.

It returns the updated list of hyperparameters.

Let’s specify the model function such that `cost`

**must** be passed, while `kernel`

is **optional** and has the default value `"radial"`

:

```
svm_model_fn <- function(train_data, formula, hyperparameters) {
# Required hyperparameters:
# - cost
# Optional hyperparameters:
# - kernel
# 1) If 'cost' is not present in hyperparameters, throw error
# 2) If 'kernel' is not present in hyperparameters, set to "radial"
hyperparameters <- update_hyperparameters(
kernel = "radial",
hyperparameters = hyperparameters,
required = "cost"
)
e1071::svm(
formula = formula,
data = train_data,
kernel = hyperparameters[["kernel"]],
cost = hyperparameters[["cost"]],
type = "eps-regression"
)
}
```

In order to find the best combination of hyperparameters for our model, we can simply try all of them. This is called *grid search*.

We specify the different values we wish to try per hyperparameter in a list of named vectors, like so:

`cross_validate_fn()`

will then cross-validate every combination of the values.

If we want to randomly sample 4 of the combinations (e.g. to save time), we can pass `.n = 4`

in the beginning of the list:

Alternatively, we can supply a data frame with the exact combinations we want. Each column should be a hyperparameter and each row a combination of the values to cross-validate. E.g.:

```
df_hparams <- data.frame(
"kernel" = c("linear", "radial", "radial"),
"cost" = c(10, 1, 10)
)
df_hparams
#> kernel cost
#> 1 linear 10
#> 2 radial 1
#> 3 radial 10
```

We will use the `hparams`

list.

Now, we can cross-validate our hyperparameters:

```
# Set seed for the sampling of the hyperparameter combinations
set.seed(1)
cv_2 <- cross_validate_fn(
data = data,
formulas = c("score ~ diagnosis + age + session",
"score ~ diagnosis"),
type = "gaussian",
model_fn = svm_model_fn,
predict_fn = svm_predict_fn,
hyperparameters = hparams, # Pass the list of values to test
fold_cols = paste0(".folds_", 1:5)
)
#> Will cross-validate 8 models. This requires fitting 120 model instances.
cv_2
#> # A tibble: 8 x 18
#> Fixed RMSE MAE `NRMSE(IQR)` RRSE RAE RMSLE Predictions Results
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
#> 1 diag… 10.2 8.03 0.522 0.556 0.544 0.286 <tibble [1… <tibbl…
#> 2 diag… 14.5 11.9 0.738 0.789 0.803 0.407 <tibble [1… <tibbl…
#> 3 diag… 10.5 8.26 0.543 0.571 0.562 0.298 <tibble [1… <tibbl…
#> 4 diag… 14.9 12.5 0.761 0.813 0.841 0.422 <tibble [1… <tibbl…
#> 5 diag… 18.6 15.2 0.930 1.00 1.01 0.514 <tibble [1… <tibbl…
#> 6 diag… 17.6 14.8 0.880 0.948 0.982 0.494 <tibble [1… <tibbl…
#> 7 diag… 17.3 14.7 0.866 0.935 0.978 0.492 <tibble [1… <tibbl…
#> 8 diag… 17.1 14.6 0.850 0.921 0.969 0.489 <tibble [1… <tibbl…
#> # … with 9 more variables: Coefficients <list>, Folds <int>, `Fold
#> # Columns` <int>, `Convergence Warnings` <int>, `Other Warnings` <int>,
#> # `Warnings and Messages` <list>, Process <list>, HParams <list<tbl_df[,2]>>,
#> # Dependent <chr>
```

The output has a lot of information and can be a bit hard to read. The first thing, we wish to know, is which model performed the best. We will use the `RMSE`

metric to determine this. Lower is better.

We order the data frame by the `RMSE`

and use the `select_definitions()`

function to extract the formulas and hyperparameters. To recognize the model after the sorting, we create a `Model ID`

column and include it along with the `RMSE`

column:

```
cv_2 %>%
# Create Model ID with values 1:8
dplyr::mutate(`Model ID` = 1:nrow(cv_2)) %>%
# Order by RMSE
dplyr::arrange(RMSE) %>%
# Extract formulas and hyperparameters
select_definitions(additional_includes = c("RMSE", "Model ID")) %>%
# Pretty table
kable()
```

Dependent | Fixed | kernel | cost | RMSE | Model ID |
---|---|---|---|---|---|

score | diagnosis+age+session | linear | 1 | 10.19112 | 1 |

score | diagnosis+age+session | linear | 5 | 10.45016 | 3 |

score | diagnosis+age+session | radial | 5 | 14.52989 | 2 |

score | diagnosis+age+session | radial | 10 | 14.93232 | 4 |

score | diagnosis | radial | 10 | 17.08093 | 8 |

score | diagnosis | linear | 5 | 17.33319 | 7 |

score | diagnosis | radial | 5 | 17.56135 | 6 |

score | diagnosis | linear | 1 | 18.56190 | 5 |

The best model uses `kernel = "linear"`

and `cost = 1`

. Our `Model ID`

column tells us this was the first row in the output. We can use this to access the predictions, fold results, warnings, and more:

Fold Column | Fold | RMSE | MAE | NRMSE(IQR) | RRSE | RAE | RMSLE |
---|---|---|---|---|---|---|---|

.folds_1 | 1 | 13.385262 | 10.403463 | 0.4461754 | 0.7605651 | 0.7190107 | 0.3841163 |

.folds_1 | 2 | 8.541764 | 6.418853 | 0.5694509 | 0.4291060 | 0.4146149 | 0.1991230 |

.folds_1 | 3 | 5.087366 | 4.001948 | 0.2451743 | 0.2688444 | 0.2581902 | 0.1367535 |

.folds_2 | 1 | 8.541764 | 6.418853 | 0.5694509 | 0.4291060 | 0.4146149 | 0.1991230 |

.folds_2 | 2 | 11.641150 | 7.999899 | 0.5543405 | 0.6775330 | 0.5408947 | 0.3356195 |

.folds_2 | 3 | 6.257352 | 4.526802 | 0.2812293 | 0.3278149 | 0.2968395 | 0.1876330 |

.folds_3 | 1 | 13.222029 | 10.549869 | 0.5085396 | 0.6736734 | 0.6283378 | 0.3611420 |

.folds_3 | 2 | 17.846276 | 15.780440 | 1.1153922 | 1.0791058 | 1.2220035 | 0.5094823 |

.folds_3 | 3 | 6.257701 | 4.599791 | 0.2812450 | 0.3260698 | 0.3032829 | 0.1830861 |

.folds_4 | 1 | 14.771646 | 13.059488 | 1.3428769 | 0.7970060 | 0.9776511 | 0.4303192 |

.folds_4 | 2 | 8.072439 | 5.624487 | 0.3844018 | 0.4681478 | 0.3954717 | 0.2366059 |

.folds_4 | 3 | 12.301286 | 9.895444 | 0.4241823 | 0.6415395 | 0.5997239 | 0.3373387 |

.folds_5 | 1 | 4.385315 | 3.529620 | 0.3132368 | 0.2825336 | 0.2797449 | 0.1903896 |

.folds_5 | 2 | 13.385262 | 10.403463 | 0.4461754 | 0.7605651 | 0.7190107 | 0.3841163 |

.folds_5 | 3 | 9.170253 | 7.229994 | 0.3460473 | 0.4179586 | 0.3919876 | 0.2207415 |

Fold Column | Fold | Observation | Target | Prediction |
---|---|---|---|---|

.folds_1 | 1 | 4 | 24 | 36.77192 |

.folds_1 | 1 | 5 | 40 | 52.30052 |

.folds_1 | 1 | 6 | 67 | 67.82912 |

.folds_1 | 1 | 13 | 24 | 11.87638 |

.folds_1 | 1 | 14 | 54 | 27.40497 |

.folds_1 | 1 | 15 | 62 | 42.93357 |

.folds_1 | 1 | 19 | 11 | 10.58407 |

.folds_1 | 1 | 20 | 35 | 26.11267 |

.folds_1 | 1 | 21 | 41 | 41.64127 |

.folds_1 | 2 | 7 | 15 | 14.94300 |

In a moment, we will go through a set of classification examples. First, we will learn to preprocess the dataset inside `cross_validate_fn()`

.

If we wish to preprocess the data, e.g. standardizing the numeric columns, we can do so **within** `cross_validate_fn()`

. The point is to extract the preprocessing parameters (`mean`

, `sd`

, `min`

, `max`

, etc.) from the training data and apply the transformations to both the training data and test data.

`cvms`

has built-in examples of preprocessing functions (see `?preprocess_functions()`

). They use the `recipes`

package and are good starting points for writing your own preprocess function.

A preprocess function should have these arguments: `train_data`

, `test_data`

, `formula`

, and `hyperparameters`

. Again, we can choose to only use some of them.

It should return a named list with the preprocessed training data (`"train"`

) and test data (`"test"`

). We can also include a data frame with the preprocessing parameters we used (`"parameters"`

), so we can extract those later from the `cross_validate_fn()`

output.

The form should be like this:

```
# NOTE: Don't run this
preprocess_fn <- function(train_data, test_data, formula, hyperparameters) {
# Do preprocessing
# Create data frame with applied preprocessing parameters
# Return list with these names
list("train" = train_data,
"test" = test_data,
"parameters" = preprocess_parameters)
}
```

Our preprocess function will standardize the `age`

column:

```
preprocess_fn <- function(train_data, test_data, formula, hyperparameters) {
# Standardize the age column
# Get the mean and standard deviation from the train_data
mean_age <- mean(train_data[["age"]])
sd_age <- sd(train_data[["age"]])
# Standardize both train_data and test_data
train_data[["age"]] <- (train_data[["age"]] - mean_age) / sd_age
test_data[["age"]] <- (test_data[["age"]] - mean_age) / sd_age
# Create data frame with applied preprocessing parameters
preprocess_parameters <- data.frame(
"Measure" = c("Mean", "SD"),
"age" = c(mean_age, sd_age)
)
# Return list with these names
list("train" = train_data,
"test" = test_data,
"parameters" = preprocess_parameters)
}
# Try the preprocess function
prepped <- preprocess_fn(train_data = train_set, test_data = test_set)
# Inspect preprocessed training set
# Note that the age column has changed
prepped$train %>% head(5) %>% kable()
```

participant | age | diagnosis | score | session | .folds_1 | .folds_2 | .folds_3 | .folds_4 | .folds_5 |
---|---|---|---|---|---|---|---|---|---|

2 | -1.3215082 | 0 | 24 | 1 | 1 | 3 | 3 | 3 | 2 |

2 | -1.3215082 | 0 | 40 | 2 | 1 | 3 | 3 | 3 | 2 |

2 | -1.3215082 | 0 | 67 | 3 | 1 | 3 | 3 | 3 | 2 |

3 | -0.6871843 | 1 | 15 | 1 | 2 | 1 | 3 | 1 | 1 |

3 | -0.6871843 | 1 | 30 | 2 | 2 | 1 | 3 | 1 | 1 |

Measure | age |
---|---|

Mean | 31.333333 |

SD | 6.305926 |

Now, we add the preprocess function to our `cross_validate_fn()`

call. We will only use the winning hyperparameters from the previous cross-validation, to save time:

```
cv_3 <- cross_validate_fn(
data = data,
formulas = c("score ~ diagnosis + age + session",
"score ~ diagnosis"),
type = "gaussian",
model_fn = svm_model_fn,
predict_fn = svm_predict_fn,
preprocess_fn = preprocess_fn,
hyperparameters = list(
"kernel" = "linear",
"cost" = 1
),
fold_cols = paste0(".folds_", 1:5)
)
#> Will cross-validate 2 models. This requires fitting 30 model instances.
cv_3
#> # A tibble: 2 x 19
#> Fixed RMSE MAE `NRMSE(IQR)` RRSE RAE RMSLE Predictions Results
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
#> 1 diag… 10.2 8.03 0.522 0.556 0.544 0.286 <tibble [1… <tibbl…
#> 2 diag… 18.6 15.2 0.930 1.00 1.01 0.514 <tibble [1… <tibbl…
#> # … with 10 more variables: Coefficients <list>, Preprocess <list>,
#> # Folds <int>, `Fold Columns` <int>, `Convergence Warnings` <int>, `Other
#> # Warnings` <int>, `Warnings and Messages` <list>, Process <list>,
#> # HParams <list<tbl_df[,2]>>, Dependent <chr>
```

This didn’t change the results but may do so in other contexts and for other model types.

We can check the preprocessing parameters for the different folds:

```
# Extract first 10 rows of the preprocess parameters
# for the first and best model
cv_3$Preprocess[[1]] %>% head(10) %>% kable()
```

Fold Column | Fold | Measure | age |
---|---|---|---|

.folds_1 | 1 | Mean | 26.571429 |

.folds_1 | 1 | SD | 5.608667 |

.folds_1 | 2 | Mean | 27.714286 |

.folds_1 | 2 | SD | 8.337523 |

.folds_1 | 3 | Mean | 31.333333 |

.folds_1 | 3 | SD | 6.305926 |

.folds_2 | 1 | Mean | 27.714286 |

.folds_2 | 1 | SD | 8.337523 |

.folds_2 | 2 | Mean | 25.000000 |

.folds_2 | 2 | SD | 4.743417 |

As mentioned, `cvms`

has a couple of preprocess functions available. Here’s the code for the standardizer. If you haven’t used the `recipes`

package before, it might not be that easy to read, but it basically does the same as ours, just to every numeric predictor. If we were to use it, we would need to make sure that the `participant`

, `diagnosis`

, and (perhaps) `session`

columns were factors, as they would otherwise be standardized as well.

```
# Get built-in preprocess function
preprocess_functions("standardize")
#> function (train_data, test_data, formula, hyperparameters)
#> {
#> formula <- simplify_formula(formula, train_data)
#> recipe_object <- recipes::recipe(formula = formula, data = train_data) %>%
#> recipes::step_center(recipes::all_numeric(), -recipes::all_outcomes()) %>%
#> recipes::step_scale(recipes::all_numeric(), -recipes::all_outcomes()) %>%
#> recipes::prep(training = train_data)
#> train_data <- recipes::bake(recipe_object, train_data)
#> test_data <- recipes::bake(recipe_object, test_data)
#> means <- recipe_object$steps[[1]]$means
#> sds <- recipe_object$steps[[2]]$sds
#> tidy_parameters <- tibble::tibble(Measure = c("Mean", "SD")) %>%
#> dplyr::bind_cols(dplyr::bind_rows(means, sds))
#> list(train = train_data, test = test_data, parameters = tidy_parameters)
#> }
#> <bytecode: 0x7fdbd064fd70>
#> <environment: 0x7fdbd065c728>
```

Given that the preprocess function also receives the hyperparameters, we *could* write a preprocess function that gets the names of the columns to preprocess via the hyperparameters.

Next, we will go through an example of cross-validating a custom binary classifier.

For the binomial example, we will be predicting the `diagnosis`

column with an SVM. For this, we need to tweak our functions a bit. First, we set `type = "C-classification"`

and `probability = TRUE`

in the `svm()`

call. The first setting makes it perform classification instead of regression. The second allows us to extract the probabilities in our predict function.

This model function also works for multiclass classification, why we will reuse it for that in a moment.

```
# SVM model function for classification
clf_svm_model_fn <- function(train_data, formula, hyperparameters) {
# Optional hyperparameters:
# - kernel
# - cost
# Update missing hyperparameters with default values
hyperparameters <- update_hyperparameters(
kernel = "radial",
cost = 1,
hyperparameters = hyperparameters
)
e1071::svm(
formula = formula,
data = train_data,
kernel = hyperparameters[["kernel"]],
cost = hyperparameters[["cost"]],
type = "C-classification",
probability = TRUE # Must enable probability here
)
}
# Try the model function
m1 <- clf_svm_model_fn(train_data = data, formula = diagnosis ~ score,
hyperparameters = list("kernel" = "linear"))
m1
#>
#> Call:
#> svm(formula = formula, data = train_data, kernel = hyperparameters[["kernel"]],
#> cost = hyperparameters[["cost"]], type = "C-classification",
#> probability = TRUE)
#>
#>
#> Parameters:
#> SVM-Type: C-classification
#> SVM-Kernel: linear
#> cost: 1
#>
#> Number of Support Vectors: 20
```

The predict function should return **the probability of the second class** (alphabetically). For the SVM, this is a bit tricky, but we will break it down in steps:

We set

`probability = TRUE`

in the`predict()`

call. This stores the probabilities as an*attribute*of the predictions. Note that this won’t work if we forget to enable the probabilities in the model function!We

**extract the probabilities**with`attr(predictions, "probabilities")`

.We

**convert the probabilities to a tibble**(a kind of data frame) with`dplyr::as_tibble()`

.At this point, we have a tibble with two columns with the probabilities for each of the two classes. As we need the probability of the second class, we

**select and return the second column**(probability of`diagnosis`

being`1`

).

In most cases, the predict function will be simpler to write than this. The main take-away is that we predict the test set and extract the probabilities of the second class.

```
# Predict function for binomial SVM
bnml_svm_predict_fn <- function(test_data, model, formula, hyperparameters, train_data) {
# Predict test set
predictions <- predict(
object = model,
newdata = test_data,
allow.new.levels = TRUE,
probability = TRUE
)
# Extract probabilities
probabilities <- dplyr::as_tibble(attr(predictions, "probabilities"))
# Return second column
probabilities[[2]]
}
p1 <- bnml_svm_predict_fn(test_data = data, model = m1)
p1 # Vector with probabilities that diagnosis is 1
#> [1] 0.1120966 0.2131934 0.4600344 0.2131934 0.3934195 0.7388358 0.1422477
#> [8] 0.2731945 0.3934195 0.3305457 0.5281085 0.8375338 0.2131934 0.5819439
#> [15] 0.6829082 0.1357207 0.2224870 0.2731945 0.1176421 0.3305457 0.4065106
#> [22] 0.1490345 0.2953850 0.4465165 0.3068642 0.5686161 0.7281734 0.2624979
#> [29] 0.5951531 0.8585921
```

Now, we can cross-validate the model function:

```
cv_4 <- cross_validate_fn(
data = data,
formulas = c("diagnosis ~ score",
"diagnosis ~ age"),
type = "binomial",
model_fn = clf_svm_model_fn,
predict_fn = bnml_svm_predict_fn,
hyperparameters = list(
"kernel" = c("linear", "radial"),
"cost" = c(1, 5, 10)
),
fold_cols = paste0(".folds_", 1:5)
)
#> Will cross-validate 12 models. This requires fitting 180 model instances.
cv_4
#> # A tibble: 12 x 28
#> Fixed `Balanced Accur… F1 Sensitivity Specificity `Pos Pred Value`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 score 0.394 0.364 0.322 0.467 0.453
#> 2 score 0.397 0.293 0.244 0.55 0.41
#> 3 score 0.403 0.367 0.322 0.483 0.460
#> 4 score 0.456 0.407 0.378 0.533 0.491
#> 5 score 0.403 0.369 0.322 0.483 0.453
#> 6 score 0.456 0.368 0.311 0.6 0.516
#> 7 age 0.617 0.701 0.733 0.5 0.693
#> 8 age 0.5 NaN 0.3 0.7 NaN
#> 9 age 0.533 NaN 0.367 0.7 NaN
#> 10 age 0.542 NaN 0.333 0.75 0.617
#> 11 age 0.550 0.536 0.5 0.6 0.653
#> 12 age 0.608 0.545 0.467 0.75 0.777
#> # … with 22 more variables: `Neg Pred Value` <dbl>, AUC <dbl>, `Lower
#> # CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection
#> # Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>,
#> # Predictions <list>, ROC <list>, `Confusion Matrix` <list>, Results <list>,
#> # Coefficients <list>, Folds <int>, `Fold Columns` <int>, `Convergence
#> # Warnings` <int>, `Other Warnings` <int>, `Warnings and Messages` <list>,
#> # Process <list>, HParams <list<tbl_df[,2]>>, Dependent <chr>
```

Let’s order the models by the `Balanced Accuracy`

metric (in descending order) and extract the formulas and hyperparameters:

```
cv_4 %>%
dplyr::mutate(`Model ID` = 1:nrow(cv_4)) %>%
dplyr::arrange(dplyr::desc(`Balanced Accuracy`)) %>%
select_definitions(additional_includes = c("Balanced Accuracy", "F1", "MCC", "Model ID")) %>%
kable()
```

Dependent | Fixed | kernel | cost | Balanced Accuracy | F1 | MCC | Model ID |
---|---|---|---|---|---|---|---|

diagnosis | age | linear | 1 | 0.6166667 | 0.7005128 | 0.2695860 | 7 |

diagnosis | age | radial | 10 | 0.6083333 | 0.5448196 | 0.2372335 | 12 |

diagnosis | age | linear | 10 | 0.5500000 | 0.5357576 | 0.1351019 | 11 |

diagnosis | age | radial | 5 | 0.5416667 | NaN | 0.0739342 | 10 |

diagnosis | age | linear | 5 | 0.5333333 | NaN | 0.0735712 | 9 |

diagnosis | age | radial | 1 | 0.5000000 | NaN | -0.0034522 | 8 |

diagnosis | score | radial | 5 | 0.4555556 | 0.4073420 | -0.1038522 | 4 |

diagnosis | score | radial | 10 | 0.4555556 | 0.3683114 | -0.1014684 | 6 |

diagnosis | score | linear | 5 | 0.4027778 | 0.3667013 | -0.2078914 | 3 |

diagnosis | score | linear | 10 | 0.4027778 | 0.3686869 | -0.2073524 | 5 |

diagnosis | score | radial | 1 | 0.3972222 | 0.2928246 | -0.2315451 | 2 |

diagnosis | score | linear | 1 | 0.3944444 | 0.3638492 | -0.2243185 | 1 |

Next, we will go through a short multiclass classification example.

For our multiclass classification example, we will use the `musicians`

dataset from `cvms`

. It has 60 musicians, grouped in four classes (`A`

, `B`

, `C`

, `D`

). The origins of the dataset is classified, so don’t ask too many questions about it!

Let’s create **5 fold columns** with **5 folds** each. We set `cat_col = "Class"`

to ensure a similar ratio of the classes in all the folds and `num_col = "Age"`

to get a similar *average age* in the folds. The latter is not required but could be useful if we had an important hypothesis regarding age.

We will not need the `id_col`

as there is only one row per musician `ID`

.

```
# Set seed for reproducibility
set.seed(1)
# Prepare dataset
data_mc <- musicians
data_mc[["ID"]] <- as.factor(data_mc[["ID"]])
# Create 5 fold columns with 5 folds each
data_mc <- fold(
data = data_mc,
k = 5,
cat_col = "Class",
num_col = "Age",
num_fold_cols = 5
)
data_mc %>% head(10) %>% kable()
```

ID | Age | Class | Height | Drums | Bass | Guitar | Keys | Vocals | .folds_1 | .folds_2 | .folds_3 | .folds_4 | .folds_5 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 41 | A | 178 | 1 | 0 | 1 | 0 | 1 | 3 | 2 | 5 | 2 | 5 |

2 | 62 | A | 168 | 1 | 1 | 1 | 1 | 0 | 4 | 4 | 1 | 5 | 1 |

3 | 19 | A | 163 | 1 | 0 | 0 | 1 | 1 | 1 | 3 | 2 | 1 | 2 |

4 | 52 | A | 177 | 1 | 0 | 0 | 0 | 1 | 1 | 4 | 2 | 1 | 2 |

5 | 32 | A | 155 | 1 | 1 | 0 | 0 | 1 | 3 | 3 | 4 | 4 | 4 |

6 | 44 | A | 164 | 0 | 0 | 1 | 1 | 1 | 2 | 1 | 5 | 4 | 5 |

7 | 51 | A | 173 | 1 | 1 | 1 | 1 | 0 | 4 | 3 | 4 | 3 | 4 |

8 | 42 | A | 176 | 1 | 0 | 1 | 1 | 0 | 3 | 2 | 3 | 2 | 3 |

9 | 25 | A | 171 | 0 | 1 | 0 | 1 | 0 | 5 | 1 | 1 | 3 | 3 |

10 | 60 | A | 167 | 1 | 0 | 0 | 1 | 0 | 5 | 1 | 1 | 3 | 3 |

```
# You can use skimr to get a better overview of the dataset
# Uncomment:
# library(skimr)
# skimr::skim(data_mc)
```

As the model function from the binomial example also works with more than 2 classes, we only need to change the predict function. In `multinomial`

classification, it should return a data frame with **one column per class** with the probabilities of that class. Hence, we copy the predict function from before and remove the `[[2]]`

from the last line:

```
# Predict function for multinomial SVM
mc_svm_predict_fn <- function(test_data, model, formula, hyperparameters, train_data) {
predictions <- predict(
object = model,
newdata = test_data,
allow.new.levels = TRUE,
probability = TRUE
)
# Extract probabilities
probabilities <- dplyr::as_tibble(attr(predictions, "probabilities"))
# Return all columns
probabilities
}
```

With this, we can cross-validate a few formulas for predicting the `Class`

. Remember, that it’s possible to run this in parallel!

```
cv_5 <- cross_validate_fn(
data = data_mc,
formulas = c("Class ~ Age + Height",
"Class ~ Age + Height + Bass + Guitar + Keys + Vocals"),
type = "multinomial",
model_fn = clf_svm_model_fn,
predict_fn = mc_svm_predict_fn,
hyperparameters = list(
"kernel" = c("linear", "radial"),
"cost" = c(1, 5, 10)
),
fold_cols = paste0(".folds_", 1:5)
)
#> Will cross-validate 12 models. This requires fitting 300 model instances.
cv_5
#> # A tibble: 12 x 26
#> Fixed `Overall Accura… `Balanced Accur… F1 Sensitivity Specificity
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Age+… 0.280 0.52 0.271 0.280 0.76
#> 2 Age+… 0.257 0.504 0.251 0.257 0.752
#> 3 Age+… 0.293 0.529 0.280 0.293 0.764
#> 4 Age+… 0.23 0.487 0.225 0.23 0.743
#> 5 Age+… 0.303 0.536 0.294 0.303 0.768
#> 6 Age+… 0.24 0.493 0.238 0.24 0.747
#> 7 Age+… 0.403 0.602 0.393 0.403 0.801
#> 8 Age+… 0.367 0.578 NaN 0.367 0.789
#> 9 Age+… 0.4 0.6 0.389 0.4 0.8
#> 10 Age+… 0.38 0.587 0.364 0.38 0.793
#> 11 Age+… 0.423 0.616 0.410 0.423 0.808
#> 12 Age+… 0.363 0.576 0.348 0.363 0.788
#> # … with 20 more variables: `Pos Pred Value` <dbl>, `Neg Pred Value` <dbl>,
#> # Kappa <dbl>, MCC <dbl>, `Detection Rate` <dbl>, `Detection
#> # Prevalence` <dbl>, Prevalence <dbl>, Predictions <list>, `Confusion
#> # Matrix` <list>, Results <list>, `Class Level Results` <list>,
#> # Coefficients <list>, Folds <int>, `Fold Columns` <int>, `Convergence
#> # Warnings` <int>, `Other Warnings` <int>, `Warnings and Messages` <list>,
#> # Process <list>, HParams <list<tbl_df[,2]>>, Dependent <chr>
```

Let’s order the results by the `Balanced Accuracy`

metric and extract the formulas and hyperparameters:

```
cv_5 %>%
dplyr::mutate(`Model ID` = 1:nrow(cv_5)) %>%
dplyr::arrange(dplyr::desc(`Balanced Accuracy`)) %>%
select_definitions(additional_includes = c(
"Balanced Accuracy", "F1", "Model ID")) %>%
kable()
```

Dependent | Fixed | kernel | cost | Balanced Accuracy | F1 | Model ID |
---|---|---|---|---|---|---|

Class | Age+Height+Bass+Guitar+Keys+Vocals | linear | 10 | 0.6155556 | 0.4100794 | 11 |

Class | Age+Height+Bass+Guitar+Keys+Vocals | linear | 1 | 0.6022222 | 0.3925407 | 7 |

Class | Age+Height+Bass+Guitar+Keys+Vocals | linear | 5 | 0.6000000 | 0.3886541 | 9 |

Class | Age+Height+Bass+Guitar+Keys+Vocals | radial | 5 | 0.5866667 | 0.3642633 | 10 |

Class | Age+Height+Bass+Guitar+Keys+Vocals | radial | 1 | 0.5777778 | NaN | 8 |

Class | Age+Height+Bass+Guitar+Keys+Vocals | radial | 10 | 0.5755556 | 0.3479595 | 12 |

Class | Age+Height | linear | 10 | 0.5355556 | 0.2940289 | 5 |

Class | Age+Height | linear | 5 | 0.5288889 | 0.2799472 | 3 |

Class | Age+Height | linear | 1 | 0.5200000 | 0.2706326 | 1 |

Class | Age+Height | radial | 1 | 0.5044444 | 0.2508571 | 2 |

Class | Age+Height | radial | 10 | 0.4933333 | 0.2383805 | 6 |

Class | Age+Height | radial | 5 | 0.4866667 | 0.2245827 | 4 |

In `multinomial`

evaluation, we perform **one-vs-all** evaluations and average them (macro metrics). These evaluations are stored in the output as `Class Level Results`

:

```
# Extract Class Level Results for the best model
cv_5$`Class Level Results`[[11]]
#> # A tibble: 4 x 14
#> Class `Balanced Accur… F1 Sensitivity Specificity `Pos Pred Value`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0.664 0.493 0.493 0.836 0.525
#> 2 B 0.62 0.413 0.4 0.84 0.471
#> 3 C 0.662 0.494 0.587 0.738 0.428
#> 4 D 0.516 0.240 0.213 0.818 0.306
#> # … with 8 more variables: `Neg Pred Value` <dbl>, Kappa <dbl>, `Detection
#> # Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>, Support <int>,
#> # Results <list>, `Confusion Matrix` <list>
```

We also have the fold results:

```
# Extract fold results for the best model
cv_5$Results[[11]]
#> # A tibble: 5 x 13
#> `Fold Column` `Overall Accura… `Balanced Accur… F1 Sensitivity Specificity
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 .folds_1 0.4 0.6 0.378 0.4 0.8
#> 2 .folds_2 0.383 0.589 0.377 0.383 0.794
#> 3 .folds_3 0.467 0.644 0.448 0.467 0.822
#> 4 .folds_4 0.383 0.589 0.378 0.383 0.794
#> 5 .folds_5 0.483 0.656 0.470 0.483 0.828
#> # … with 7 more variables: `Pos Pred Value` <dbl>, `Neg Pred Value` <dbl>,
#> # Kappa <dbl>, MCC <dbl>, `Detection Rate` <dbl>, `Detection
#> # Prevalence` <dbl>, Prevalence <dbl>
```

And a set of multiclass confusion matrices (one per fold column):

```
# Extract multiclass confusion matrices for the best model
# One per fold column
cv_5$`Confusion Matrix`[[11]]
#> # A tibble: 80 x 4
#> `Fold Column` Prediction Target N
#> <chr> <chr> <chr> <int>
#> 1 .folds_1 A A 9
#> 2 .folds_1 B A 4
#> 3 .folds_1 C A 1
#> 4 .folds_1 D A 1
#> 5 .folds_1 A B 4
#> 6 .folds_1 B B 6
#> 7 .folds_1 C B 3
#> 8 .folds_1 D B 2
#> 9 .folds_1 A C 2
#> 10 .folds_1 B C 0
#> # … with 70 more rows
```

We can add these together (or average them) and plot the result:

```
# Sum the fold column confusion matrices
# to one overall confusion matrix
overall_confusion_matrix <- cv_5$`Confusion Matrix`[[11]] %>%
dplyr::group_by(Prediction, Target) %>%
dplyr::summarise(N = sum(N))
#> `summarise()` regrouping output by 'Prediction' (override with `.groups` argument)
overall_confusion_matrix %>% kable()
```

Prediction | Target | N |
---|---|---|

A | A | 37 |

A | B | 15 |

A | C | 9 |

A | D | 13 |

B | A | 15 |

B | B | 30 |

B | C | 7 |

B | D | 14 |

C | A | 15 |

C | B | 12 |

C | C | 44 |

C | D | 32 |

D | A | 8 |

D | B | 18 |

D | C | 15 |

D | D | 16 |

This concludes the vignette. If elements are unclear or you need help to convert your model, you can leave feedback in a mail or in a GitHub issue :-)