This short vignette is meant to introduce users to the MIAmaxent package by providing a worked example of a distribution modeling exercise. It shows how to use all of the main functions included in MIAmaxent in the order of a typical analysis. This vignette does NOT describe the theoretical underpinnings of the package. To learn more about the theory behind MIAmaxent, the user is referred to Halvorsen (2013) and Halvorsen et al. (2015), as well as other references listed in the documentation of the package.

Help pages for the package and for individual functions in the package can be accessed in R using the help command: ?'MIAmaxent' (after attaching the package using library(MIAmaxent)).


Data set

The data used for demonstration in this vignette are a set of data that have been used to model the distribution of semi-natural grasslands in Østfold County, in southeastern Norway. The data set consists of 1059 locations where presence of semi-natural grasslands has been recorded, 13 environmental variables covering the extent of the study area, and 122 locations where the presence or absence of semi-natural grasslands has been recorded, independently of the presence-only records. The extent of the study area is about 4000 square kilometers, and grain of the raster data is 500 meters (0.25 km2).

The data used in this vignette are included in the package as an example data set, so the code and results shown here can be directly replicated.

Before beginning the modeling exercise, it may be useful to see what the data look like in their geographical representation. We can use the raster package to plot the 1059 recorded presences on top of one of the environmental variable rasters:

library(raster)
#> Loading required package: sp
EV1 <- raster(list.files(system.file("extdata", "EV_continuous", package = "MIAmaxent"), full.names = TRUE)[1])
PO <- read.csv(system.file("extdata", "occurrence_PO.csv", package = "MIAmaxent"))
par(mar=c(3.1,3.1,2.1,0.1))
plot(EV1, colNA = 'black', legend=FALSE)
points(PO$POINT_X, PO$POINT_Y, pch = 20, cex = 0.5, col = 'blue')
par(mar=c(5.1,4.1,4.1,2.1))
Presence-only occurrence across the study area

Presence-only occurrence across the study area

readData(…)

The starting point for modeling using MIAmaxent is a simple data object that contains occurrence data for the modeled target as well as all of the environmental data. The format of this data object is a data frame with the response variable of the model (occurrence) in the first column, and explanatory variables of the model (environmental variables) in subsequent columns. In generative Maxent modeling, the response variable (RV) is binary, and can be either presence or unknown background. These values are coded as “1” and “NA” respectively in the data object. Explanatory variables (EVs) may be continuous or categorical, and these types are denoted by numeric class and factor class respectively.

The readData(...) function transforms data in CSV and ASCII raster file formats into a single data frame which serves as the starting point for modeling.

Users of the maxent.jar program are usually accustomed to having their training data in an a different format. Specifically, occurrence data is often in CSV file format, with presences records followed by coordinates, and environmental data in ASCII raster file format. The readData function makes it easy to convert these data formats into the data object that is used in MIAmaxent. This function extracts values of the environmental variables at presence locations and at a set of randomly selected background locations, and properly formats these into the starting point for modeling. Alternatively, the user can also specify a custom set of background locations by giving these in the CSV file.

We begin by creating our data object from file. Note that continuous and categorical environmental variables should be placed in separate directories:

library(MIAmaxent)
grasslandPO <- readData(occurrence = system.file("extdata", "occurrence_PO.csv", package = "MIAmaxent"), 
                  contEV = system.file("extdata", "EV_continuous", package = "MIAmaxent"),
                  catEV = system.file("extdata", "EV_categorical", package = "MIAmaxent"),
                  maxbkg = 20000)

All functions in MIAmaxent produce console output. Therefore it’s handy to always assign function output to an object, so that you can manipulate that object further.

If we look at the resulting data object we see the response variable (with 1059 presence and 16420 background points) along with 8 continuous and 5 categorical explanatory variables:

str(grasslandPO)
#> 'data.frame':    17479 obs. of  14 variables:
#>  $ RV        : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ pca1      : num  0.000504 0.000506 0.000554 0.00051 0.000486 ...
#>  $ pr-bygall : num  0.00772 0.00134 0.003 0.01049 0.00092 ...
#>  $ pr-tilany : num  0 0 0.3898 0.0884 0.1316 ...
#>  $ teraspif  : num  180 95 20 111 116 61 24 34 35 35 ...
#>  $ terdem    : num  0 0 0 12 0 0 0 11 26 39 ...
#>  $ terslpdg  : num  0 0.827 1.433 1.175 0.813 ...
#>  $ tersolrade: num  977 966 1018 956 960 ...
#>  $ tertpi09  : num  -3.54 -7.51 -6.11 -2.38 -15.16 ...
#>  $ geoberg   : Factor w/ 16 levels "0","2","3","4",..: 10 10 9 10 10 10 10 10 10 10 ...
#>  $ geolmja1  : Factor w/ 15 levels "11","12","15",..: 8 7 15 7 8 7 8 7 7 7 ...
#>  $ lcucor1   : Factor w/ 21 levels "111","112","121",..: 21 21 21 8 21 8 21 8 12 2 ...
#>  $ lcutil-t4 : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 2 1 ...
#>  $ terslpps15: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 6 1 1 1 1 6 6 ...
sum(grasslandPO$RV == 1, na.rm = TRUE)
#> [1] 1059
sum(is.na(grasslandPO$RV))
#> [1] 16420

IMPORTANT: Important preparations for distribution modeling, such as accounting for sampling bias and setting study area extent, are not dealt with in MIAmaxent, and must be dealt with beforehand. Good modeling practice requires that these issues be attended to!


Occurrence-environment relationships

By its simplest definition, a distribution model examines the relationship between the modeled target and its environment. In this way, distribution modeling follows the long tradition of gradient analysis in ecology (Halvorsen, 2012). Therefore, before building an actual model, we should have some idea about what influence the environmental variables have on the occurrence of the target.

plotFOP(…)

We can use the plotFOP function to create a so-called Frequency of Observed Presence (FOP) plot. An FOP plot shows how commonly the target occurs across the range of the explanatory variable, and makes it possible to recognize patterns in frequency of occurrence. Most often, the relationship between an environmental variable and modeled target is linear or unimodal, but the pattern seen in the FOP plot depends on the range of the EV, which in turn depends on the extent of the study area.

Here we examine FOP plots for 2 of the continuous explanatory variables:

teraspifFOP <- plotFOP(grasslandPO, "teraspif")

terslpdgFOP <- plotFOP(grasslandPO, "terslpdg")

We can change the appearance of the plot with additional arguments, or access the plot data directly:

terslpdgFOP <- plotFOP(grasslandPO, "terslpdg", intervals = 25,  pch=20, cex=1.1, col = "red")

terslpdgFOP$FOPdata
#>              int    n     intEV      intRV   smoothRV
#> 1  [0.000,0.380) 1829 0.1974441 0.06232914         NA
#> 2  [0.380,0.761) 2844 0.5750047 0.06645570         NA
#> 3  [0.761,1.141) 3047 0.9473344 0.06005907 0.06181457
#> 4  [1.141,1.522) 2542 1.3258624 0.06648308 0.06107528
#> 5  [1.522,1.902) 1981 1.7033925 0.04896517 0.05372778
#> 6  [1.902,2.283) 1517 2.0811801 0.05141727 0.05284629
#> 7  [2.283,2.663) 1144 2.4662042 0.05419580 0.05489220
#> 8  [2.663,3.043)  812 2.8471670 0.05665025 0.05930401
#> 9  [3.043,3.424)  538 3.2173911 0.07063197 0.06589245
#> 10 [3.424,3.804)  372 3.6004458 0.06989247 0.06610008
#> 11 [3.804,4.185)  264 3.9885431 0.05681818 0.06119433
#> 12 [4.185,4.565)  182 4.3489249 0.06043956 0.06324177
#> 13 [4.565,4.946)  128 4.7418206 0.06250000 0.06742092
#> 14 [4.946,5.326)   93 5.1260939 0.09677419 0.08129914
#> 15 [5.326,5.706)   56 5.5293487 0.05357143 0.07129972
#> 16 [5.706,6.087)   35 5.9119111 0.11428571 0.08431723
#> 17 [6.087,6.467)   40 6.2669473 0.02500000 0.07247652
#> 18 [6.467,6.848)   21 6.6230410 0.09523810 0.10865229
#> 19 [6.848,7.228)   15 7.0089600 0.26666667 0.15205428
#> 20 [7.228,7.609)   10 7.4276010 0.00000000 0.05531730
#> 21 [7.609,7.989)    4 7.7217125 0.00000000 0.01798688
#> 22 [7.989,8.369)    3 8.1058232 0.00000000 0.00000000
#> 23 [8.750,9.130)    1 9.0068598 0.00000000         NA
#> 24 [9.130,9.511]    1 9.5107498 0.00000000         NA

Now we examine FOP plots for 1 of the categorical explanatory variables:

geobergFOP <- plotFOP(grasslandPO, 10)

We see that geoberg type 4 has the highest rate of observed presence, followed by type 2, and then types 3 and 28. If we look more closely at the data however, we notice also that geoberg type 4 is sampled very rarely, with only 6 locations falling into that category:

geobergFOP
#> $EVoptimum
#> [1] 4
#> Levels: 0 2 3 4 5 21 22 26 27 28 29 35 40 62 82 85
#> 
#> $FOPdata
#>       n level    levelRV
#> 1     5     0 0.00000000
#> 2    20     2 0.45000000
#> 3    74     3 0.28378378
#> 4     6     4 0.50000000
#> 5   456     5 0.10526316
#> 6  3448    21 0.05974478
#> 7    46    22 0.06521739
#> 8    14    26 0.00000000
#> 9    21    27 0.09523810
#> 10   58    28 0.27586207
#> 11    6    29 0.00000000
#> 12  369    35 0.11382114
#> 13    1    40 0.00000000
#> 14 8183    62 0.05230356
#> 15 4445    82 0.05939258
#> 16  327    85 0.05198777

It is recommended that FOP plots for all explanatory variables be examined before building a model.

Looking at FOP plots can help the modeler decide which explanatory variables are likely to have greatest explanatory power, and gives an idea of the strength and shape of the relationships between the explanatory and response variables.


Transforming explanatory variables

To fit the many different kinds of relationships between explanatory and response variables, we need to transform the explanatory variables. This means that we create new “derived” variables (DVs) from the original explanatory variables. Another way of thinking about this is to put it in terms of rescaling the explanatory variable; we adjust the scale of the explanatory variable in many different ways in order to check which scaling is most ecologically relevant to the occurrence of the modeled target.

deriveVars(…)

The deriveVars function produces derived variables from explanatory variables by 7 different transformation types: linear, monotonous, deviation, forward hinge, reverse hinge, threshold, and binary. The first 6 of these are relevant for continuous variables and the binary transformation is relevant only for categorical variables. Different types of transformations can be turned on or off in order to balance model complexity with model fit.

For the spline-type transformations (forward hinge, reverse hinge, threshold) an endless number of different transformations are possible, so the function produces 20 of each, and then chooses those which explain the most variation in the response variable. This means that 20 models are built and evaluated for each combination of explanatory variable and spline transformation. Therefore, running deriveVars with these transformation types turned on may take some time.

Here we produce all types of derived variables from our explanatory variables:

grasslandDVs <- deriveVars(grasslandPO, transformtype = c("L", "M", "D", "HF", "HR", "T", "B"))

Use the dir argument to specify the directory to which files will be written in the deriveVars, selectDVforEV, selectEV, and plotResp functions.

The console output of deriveVars (which is saved above as grasslandDVs) consists of 2 parts:

  • data frames of DVs for each EV (named “EVDV”)
  • the transformation function used to produce each DV (named “transformations”) In our grasslands analysis, the contents of these two items look like this:
summary(grasslandDVs$EVDV) # alternatively: summary(grasslandDVs[[1]])
#>            Length Class      Mode
#> pca1       13     data.frame list
#> pr-bygall   5     data.frame list
#> pr-tilany  12     data.frame list
#> teraspif   11     data.frame list
#> terdem      8     data.frame list
#> terslpdg    8     data.frame list
#> tersolrade  6     data.frame list
#> tertpi09    8     data.frame list
#> geoberg    16     data.frame list
#> geolmja1   15     data.frame list
#> lcucor1    21     data.frame list
#> lcutil-t4   2     data.frame list
#> terslpps15  6     data.frame list
head(summary(grasslandDVs$transformations)) # alternatively: head(summary(grasslandDVs[[2]]))
#>                  Length Class  Mode    
#> pca1_L_transf    1      -none- function
#> pca1_M_transf    1      -none- function
#> pca1_D0.5_transf 1      -none- function
#> pca1_D1_transf   1      -none- function
#> pca1_D2_transf   1      -none- function
#> pca1_HF5_transf  1      -none- function
length(grasslandDVs$transformations)
#> [1] 131

Note that the names of derived variables indicate the type of transformation was used to create them. For example, “terslpdg_D2” is a deviation-type transformation of terslpdg, where the slope of the deviation is controlled by a parameter value of 2. Meanwhile, “terslpdg_HR4” is the a reverse hinge transformation, with the knot in the 4th position.

Underscores (_) are used to denote derived variables, and colons (:) are used to denote interaction terms, so explanatory variable names must not contain these characters. EV names should also be unique.

We can check how derived variables relate to the original, untransformed explanatory variable from which they came. Here we look at “terslpdg_D2” and “terslpdg_HR4”:

plot(grasslandPO$terslpdg, grasslandDVs$EVDV$terslpdg$terslpdg_D2, pch = 20)
plot(grasslandPO$terslpdg, grasslandDVs$EVDV$terslpdg$terslpdg_HR4, pch = 20)


The models used to select spline-type derived variables are written to file and the process of spline selection can be inspected by navigating to the directory to which the files were written (default is the working directory). Inside the “deriveVars” folder are folders for each explanatory variable, inside of which are folders for the spline transformation types. Inside each of these folders are 2 files of particular interest: the “splineselection.csv” file gives the model details for each of the 20 candidate derived variables, while the “Vknotplot.png” file shows the variation explained by the 20 candidate variables. Selected candidates are marked in red. Below is an example of one of the V-knot plots:

V-knot plot


Selecting variables

With derived variables ready, we are ready to begin the process of choosing which variables to include in the model. This is arguably the most critical process in building a model. Following the principle of parsimony, the aim in selecting variables is to explain as much variation in the response variable as efficiently as possible. The greater the number of explanatory or derived variables included in the model, the more variation in the response variable we can explain, but at the cost of model complexity. In the MIAmaxent package, the benefit of additional variation explained is weighed against the cost in model complexity using an F-test. Variables are added to the model one by one in a process of forward selection, and each new model is compared to its predecessor. Another term for this process is “nested model comparison”.

Rather than selecting from the full pool of derived variables one by one, MIAmaxent performs variable selection in two parts:

  1. First, a group of derived variables is selected separately for each individual explanatory variable. This is done using the selectDVforEV function.
  2. Second, the explanatory variables themselves, each represented by a group of derived variables, are selected. This is done using the selectEV function.

Variable selection occurs hierarchically: first derived variables within each explanatory variable, then explanatory variables within the full model.

selectDVforEV(…)

The selectDVforEV function performs forward selection of individual derived variables (DVs) for each explanatory variable (EV). In other words, the function takes each EV one by one, and narrows the group of DVs produced from that EV (by deriveVars) to a group which explains variation in the response variable most efficiently.

The alpha-value specified in the function is used in the F-test during forward selection, and sets the threshold for how much variation a DV must explain to be retained. Lower alpha values signify a more conservative test, such that DVs must explain more variation to be included.

Here we use selectDVforEV on the grassland data set. Note the “[[1]]” following grasslandsDV, which specifies the list of DVs in the deriveVars output (see ?deriveVars Value).

grasslandDVselect <- selectDVforEV(grasslandPO, grasslandDVs[[1]], alpha = 0.001)
#> Forward selection of DVs for 13 EVs
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |=================================================================| 100%

The output of selectDVforEV consists of 2 main parts:

  • the DVs that were selected for each EV (named “selectedDV”)
  • the trails of nested models that were built and compared for each explanatory variable during the selection process (named “selection”)

We can see that selectDVforEV has reduced the number of derived variables by comparing the list of DVs before and after:

summary(grasslandDVs$EVDV)
#>            Length Class      Mode
#> pca1       13     data.frame list
#> pr-bygall   5     data.frame list
#> pr-tilany  12     data.frame list
#> teraspif   11     data.frame list
#> terdem      8     data.frame list
#> terslpdg    8     data.frame list
#> tersolrade  6     data.frame list
#> tertpi09    8     data.frame list
#> geoberg    16     data.frame list
#> geolmja1   15     data.frame list
#> lcucor1    21     data.frame list
#> lcutil-t4   2     data.frame list
#> terslpps15  6     data.frame list
sum(sapply(grasslandDVs$EVDV, length))
#> [1] 131
summary(grasslandDVselect$selectedDV)
#>            Length Class      Mode
#> pca1       1      data.frame list
#> pr-bygall  1      data.frame list
#> pr-tilany  1      data.frame list
#> teraspif   1      data.frame list
#> terdem     2      data.frame list
#> tertpi09   2      data.frame list
#> geoberg    6      data.frame list
#> geolmja1   5      data.frame list
#> lcucor1    7      data.frame list
#> terslpps15 1      data.frame list
sum(sapply(grasslandDVselect$selectedDV, length))
#> [1] 27

Note also that the number of explanatory variables was reduced from 13 to 10. Explanatory variables for which none of the derived variables explained a significant amount of variance are dropped.

Here is an example of one of the trails of forward DV selection. Shown is the trail for the “terdem” EV:

grasslandDVselect$selection$terdem[, -13]
#>    round model                      DV m trainAUC Entropy         FVA
#> 1      1     7             terdem_HR18 1   0.5866  9.7239 0.015998801
#> 2      1     2                terdem_M 1   0.5865  9.7240 0.015963133
#> 3      1     1                terdem_L 1   0.5864  9.7241 0.015927466
#> 4      1     3             terdem_D0.5 1   0.5882  9.7247 0.015713461
#> 5      1     6              terdem_HR4 1   0.5624  9.7252 0.015535124
#> 6      1     4               terdem_D1 1   0.5878  9.7271 0.014857442
#> 7      1     5               terdem_D2 1   0.5830  9.7322 0.013038401
#> 8      1     8               terdem_T8 1   0.5571  9.7430 0.009186314
#> 9      2     4  terdem_HR18 terdem_HR4 2   0.5867  9.7188 0.017817842
#> 10     2     7   terdem_HR18 terdem_T8 2   0.5842  9.7234 0.016177138
#> 11     2     5   terdem_HR18 terdem_D1 2   0.5841  9.7236 0.016105803
#> 12     2     3 terdem_HR18 terdem_D0.5 2   0.5892  9.7237 0.016070136
#> 13     2     1    terdem_HR18 terdem_M 2   0.5866  9.7239 0.015998801
#> 14     2     2    terdem_HR18 terdem_L 2   0.5866  9.7239 0.015998801
#> 15     2     6   terdem_HR18 terdem_D2 2   0.5866  9.7239 0.015998801
#>        addedFVA Fstatistic dfe   dfu       Pvalue
#> 1  1.599880e-02 266.922757   1 16417 0.000000e+00
#> 2  1.596313e-02 266.318030   1 16417 0.000000e+00
#> 3  1.592747e-02 265.713347   1 16417 0.000000e+00
#> 4  1.571346e-02 262.086171   1 16417 0.000000e+00
#> 5  1.553512e-02 259.064728   1 16417 0.000000e+00
#> 6  1.485744e-02 247.593223   1 16417 0.000000e+00
#> 7  1.303840e-02 216.879182   1 16417 0.000000e+00
#> 8  9.186314e-03 152.209963   1 16417 0.000000e+00
#> 9  1.819041e-03  30.403095   1 16416 3.562622e-08
#> 10 1.783374e-04   2.975725   1 16416 8.454108e-02
#> 11 1.070024e-04   1.785305   1 16416 1.815177e-01
#> 12 7.133494e-05   1.190160   1 16416 2.753132e-01
#> 13 0.000000e+00   0.000000   1 16416 1.000000e+00
#> 14 0.000000e+00   0.000000   1 16416 1.000000e+00
#> 15 0.000000e+00   0.000000   1 16416 1.000000e+00

The columns in this data frame represent: the round of variable selection (round), the name of the model per round (model), the names of the derived variables included in the model (DV), the number of variables in the model (m), the training AUC (trainAUC), the Entropy or log loss of the model (Entropy), the fraction of variation explained (FVA), the additional variation explained compared to the best model of the previous round (addedFVA), the degrees of freedom of the explained variance (dfe), the degrees of freedom of the unexplained variance (dfu), and the p-value of the F-test comparing the model to the previous round’s best model (Pvalue). The model directory, recorded in column 13, is not shown.

This table shows, for example, that in the first round of selection, one model was built for each of the 8 derived variables, and that all of these explained enough variation to be retained for the second round of selection. Of all the derived variables produced from “terdem”, “terdem_HR18” explained the most variation in the response variable. Furthermore, “terdem_HR4” explained enough of the remaining variation to be selected as well (Pvalue < alpha).

selectEV(…)

Part 2 of variable selection using MIAmaxent is performed by the selectEV function. This function is similar to the selectDVforEV function, but instead of selecting parsimonious groups of derived variables to represent each explanatory variable, selectEV selects explanatory variables. This proceeds in the same manner as selectDVforEV, with nested model comparison using the F-test. Where selectDVforEV adds a single DV at a time, selectEV adds a single group of DVs, representing a single EV, at a time.

The selectEV function also differs from selectDVforEV in one other very important way; it includes the option of including interaction terms between selected explanatory variables in the model (interaction = TRUE). Interaction terms between variables are useful when one explanatory variable changes the effect of another explanatory variable on the modeled target. In MIAmaxent interaction terms are produced by simple multiplication between all pairs of derived variables representing the two interacting explanatory variables. Interaction terms are only allowed between EVs which are both included in the model.

Here we use selectEV on the grassland data set. Note the “[[1]]” following grasslandsDVselect, which specifies the list of selected DVs in the selectDVforEV output (see ?selectDVforEV Value).

grasslandEVselect <- selectEV(grasslandPO, grasslandDVselect[[1]], alpha = 0.001, interaction = TRUE)
#> Forward selection of 10 EVs
#> Round 1 complete.
#> Round 2 complete.
#> Round 3 complete.
#> Round 4 complete.
#> Round 5 complete.
#> Round 6 complete.
#> Round 7 complete.
#> Round 8 complete.
#> Round 9 complete.
#> Forward selection of interaction terms between 9 EVs
#> Round 10 complete.
#> Round 11 complete.
#> Round 12 complete.
#> Round 13 complete.

The output of selectEV consists of 2 main parts:

  • the EVs that were selected (named “selectedEV”), each represented by a group of selected DVs.
  • the trail of nested models that were built and compared during the selection process (named “selection”)

Now compare the list of EVs before and after:

summary(grasslandDVselect[[1]])
#>            Length Class      Mode
#> pca1       1      data.frame list
#> pr-bygall  1      data.frame list
#> pr-tilany  1      data.frame list
#> teraspif   1      data.frame list
#> terdem     2      data.frame list
#> tertpi09   2      data.frame list
#> geoberg    6      data.frame list
#> geolmja1   5      data.frame list
#> lcucor1    7      data.frame list
#> terslpps15 1      data.frame list
length(grasslandDVselect[[1]])
#> [1] 10
summary(grasslandEVselect[[1]])
#>                      Length Class      Mode
#> pr-bygall            1      data.frame list
#> terslpps15           1      data.frame list
#> pca1                 1      data.frame list
#> lcucor1              7      data.frame list
#> geoberg              6      data.frame list
#> teraspif             1      data.frame list
#> tertpi09             2      data.frame list
#> pr-tilany            1      data.frame list
#> geolmja1             5      data.frame list
#> pr-bygall:terslpps15 1      data.frame list
#> terslpps15:lcucor1   7      data.frame list
#> terslpps15:geoberg   6      data.frame list
#> pr-bygall:geoberg    6      data.frame list
length(grasslandEVselect[[1]])
#> [1] 13

We can see that selectEV has reduced the number of explanatory variables to 9, but it has also added new interaction terms between explanatory variables selected for the model.

Here we show the best model of each round in the trail of forward selection of EVs and interaction terms, hiding a few columns for readability:

#>    round model  m trainAUC        FVA     addedFVA Fstatistic       Pvalue
#> 1      1     2  1   0.6240 0.03665027 0.0366502665 624.578389 0.000000e+00
#> 11     2     1  2   0.6425 0.04538880 0.0087385304 150.272398 0.000000e+00
#> 20     3     2  3   0.6493 0.04881287 0.0034240772  59.090610 1.598721e-14
#> 28     4     1 10   0.6696 0.06892933 0.0201164536  50.643811 0.000000e+00
#> 35     5     1 16   0.6863 0.08333899 0.0144096583  42.972486 0.000000e+00
#> 41     6     1 17   0.6882 0.08558604 0.0022470507  40.303276 2.231570e-10
#> 45     7     1 19   0.6908 0.08836810 0.0027820627  25.022735 1.410305e-11
#> 48     8     1 20   0.6918 0.08922412 0.0008560193  15.412139 8.678455e-05
#> 50     9     1 25   0.6941 0.09246986 0.0032457399  11.725762 2.445744e-11
#> 51    10     1 26   0.6947 0.09314754 0.0006776819  12.249581 4.665984e-04
#> 87    11     2 33   0.6975 0.09596527 0.0028177302   7.295628 9.196942e-09
#> 92    12     2 39   0.6991 0.09778431 0.0018190410   5.503871 1.051371e-05
#> 95    13     1 45   0.7016 0.09935368 0.0015693687   4.754969 7.534903e-05

The full selection trail is automatically saved as a CSV-format file in the directory specified by the dir argument in selectEV. It is recommended to open this file (named: “evselection.csv”) using spreadsheet software to inspect the trail of forward selection more closely. This allows the user to make an informed decision about what is the best model.

The trail of forward variable selection is automatically written to file (in the specified dir), and should be examined using spreadsheet software.

We can also look at the selection trail in graphical form. For example, we may plot additional variation explained across round number, to see how much better the model fit is for each round of adding a variable:

plot(grasslandEVselect$selection$round, grasslandEVselect$selection$addedFVA)

For example, we may decide that a simple model with only 5 explanatory variables is best, since rounds 6-13 explain only small amounts of additional variation. Thus, we choose the best model in round 5:

grasslandEVselect[[2]][grasslandEVselect$selection$round == 5, ][1, -13]
#>    round model                                        EV  m trainAUC
#> 35     5     1 pr-bygall terslpps15 pca1 lcucor1 geoberg 16   0.6863
#>    Entropy        FVA   addedFVA Fstatistic dfe   dfu Pvalue
#> 35  9.5351 0.08333899 0.01440966   42.97249   6 16402      0

The explanatory variables included in this model are: pr-bygall, terslpps15, pca1, lcucor1, and geoberg.


Exploring the model

After building a model by selecting which explanatory variable to include, it is useful to explore what the model actually looks like. A straightforward way to do this is to look at model predictions over a range of explanatory data. This gives the modeler a sense of the relationships that the model has captured between explanatory variables and response variables, and can also help the modeler understand strengths and weaknesses in the model.

plotResp(…)

We can use the plotResp function to create a so-called response plot. A response plot graphs model output across the range of one particular explanatory variable. When other variables are excluded from the model entirely, this is called a “single-effect response plot”.

Here we examine a single-effect response plot for the most important explanatory variable included in the model chosen above:

prbygallResp <- plotResp(grasslandPO, grasslandEVselect[[1]], "pr-bygall")

To assess how well the relationship between the explanatory variable and the response variable is captured by the model, it can be useful to examine FOP plots and response plots side-by-side:

prbygallFOP <- plotFOP(grasslandPO, "pr-bygall", intervals=50)
prbygallResp <- plotResp(grasslandPO, grasslandEVselect[[1]], "pr-bygall")

The values on the y-axes of the plots are not directly comparable, but one can expect that the shape of the response plot curve should mirror, to a greater or lesser extent, the shape of the FOP plot curve.

Here is the same comparison for one of the categorical variables included in the model:

terslpps15FOP <- plotFOP(grasslandPO, "terslpps15")
terslpps15Resp <- plotResp(grasslandPO, grasslandEVselect[[1]], "terslpps15")