This short vignette is meant to introduce users to the MIAmaxent
package by providing a worked example of a distribution modeling exercise. It shows how to use all of the main functions included in MIAmaxent
in the order of a typical analysis. This vignette does NOT describe the theoretical underpinnings of the package. To learn more about the theory behind MIAmaxent
, the user is referred to Halvorsen (2013) and Halvorsen et al. (2015), as well as other references listed in the documentation of the package.
Help pages for the package and for individual functions in the package can be accessed in R using the help command:
?'MIAmaxent'
(after attaching the package usinglibrary(MIAmaxent)
).
The data used for demonstration in this vignette are a set of data that have been used to model the distribution of semi-natural grasslands in Østfold County, in southeastern Norway. The data set consists of 1059 locations where presence of semi-natural grasslands has been recorded, 13 environmental variables covering the extent of the study area, and 122 locations where the presence or absence of semi-natural grasslands has been recorded, independently of the presence-only records. The extent of the study area is about 4000 square kilometers, and grain of the raster data is 500 meters (0.25 km2).
The data used in this vignette are included in the package as an example data set, so the code and results shown here can be directly replicated.
Before beginning the modeling exercise, it may be useful to see what the data look like in their geographical representation. We can use the raster
package to plot the 1059 recorded presences on top of one of the environmental variable rasters:
library(raster)
#> Loading required package: sp
EV1 <- raster(list.files(system.file("extdata", "EV_continuous", package = "MIAmaxent"), full.names = TRUE)[1])
PO <- read.csv(system.file("extdata", "occurrence_PO.csv", package = "MIAmaxent"))
par(mar=c(3.1,3.1,2.1,0.1))
plot(EV1, colNA = 'black', legend=FALSE)
points(PO$POINT_X, PO$POINT_Y, pch = 20, cex = 0.5, col = 'blue')
par(mar=c(5.1,4.1,4.1,2.1))
Presence-only occurrence across the study area
The starting point for modeling using MIAmaxent
is a simple data object that contains occurrence data for the modeled target as well as all of the environmental data. The format of this data object is a data frame with the response variable of the model (occurrence) in the first column, and explanatory variables of the model (environmental variables) in subsequent columns. In generative Maxent modeling, the response variable (RV) is binary, and can be either presence or unknown background. These values are coded as “1” and “NA” respectively in the data object. Explanatory variables (EVs) may be continuous or categorical, and these types are denoted by numeric class and factor class respectively.
The
readData(...)
function transforms data in CSV and ASCII raster file formats into a single data frame which serves as the starting point for modeling.
Users of the maxent.jar program are usually accustomed to having their training data in an a different format. Specifically, occurrence data is often in CSV file format, with presences records followed by coordinates, and environmental data in ASCII raster file format. The readData
function makes it easy to convert these data formats into the data object that is used in MIAmaxent
. This function extracts values of the environmental variables at presence locations and at a set of randomly selected background locations, and properly formats these into the starting point for modeling. Alternatively, the user can also specify a custom set of background locations by giving these in the CSV file.
We begin by creating our data object from file. Note that continuous and categorical environmental variables should be placed in separate directories:
library(MIAmaxent)
grasslandPO <- readData(occurrence = system.file("extdata", "occurrence_PO.csv", package = "MIAmaxent"),
contEV = system.file("extdata", "EV_continuous", package = "MIAmaxent"),
catEV = system.file("extdata", "EV_categorical", package = "MIAmaxent"),
maxbkg = 20000)
All functions in
MIAmaxent
produce console output. Therefore it’s handy to always assign function output to an object, so that you can manipulate that object further.
If we look at the resulting data object we see the response variable (with 1059 presence and 16420 background points) along with 8 continuous and 5 categorical explanatory variables:
str(grasslandPO)
#> 'data.frame': 17479 obs. of 14 variables:
#> $ RV : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ pca1 : num 0.000504 0.000506 0.000554 0.00051 0.000486 ...
#> $ pr-bygall : num 0.00772 0.00134 0.003 0.01049 0.00092 ...
#> $ pr-tilany : num 0 0 0.3898 0.0884 0.1316 ...
#> $ teraspif : num 180 95 20 111 116 61 24 34 35 35 ...
#> $ terdem : num 0 0 0 12 0 0 0 11 26 39 ...
#> $ terslpdg : num 0 0.827 1.433 1.175 0.813 ...
#> $ tersolrade: num 977 966 1018 956 960 ...
#> $ tertpi09 : num -3.54 -7.51 -6.11 -2.38 -15.16 ...
#> $ geoberg : Factor w/ 16 levels "0","2","3","4",..: 10 10 9 10 10 10 10 10 10 10 ...
#> $ geolmja1 : Factor w/ 15 levels "11","12","15",..: 8 7 15 7 8 7 8 7 7 7 ...
#> $ lcucor1 : Factor w/ 21 levels "111","112","121",..: 21 21 21 8 21 8 21 8 12 2 ...
#> $ lcutil-t4 : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 2 1 ...
#> $ terslpps15: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 6 1 1 1 1 6 6 ...
sum(grasslandPO$RV == 1, na.rm = TRUE)
#> [1] 1059
sum(is.na(grasslandPO$RV))
#> [1] 16420
IMPORTANT: Important preparations for distribution modeling, such as accounting for sampling bias and setting study area extent, are not dealt with in MIAmaxent
, and must be dealt with beforehand. Good modeling practice requires that these issues be attended to!
By its simplest definition, a distribution model examines the relationship between the modeled target and its environment. In this way, distribution modeling follows the long tradition of gradient analysis in ecology (Halvorsen, 2012). Therefore, before building an actual model, we should have some idea about what influence the environmental variables have on the occurrence of the target.
We can use the plotFOP
function to create a so-called Frequency of Observed Presence (FOP) plot. An FOP plot shows how commonly the target occurs across the range of the explanatory variable, and makes it possible to recognize patterns in frequency of occurrence. Most often, the relationship between an environmental variable and modeled target is linear or unimodal, but the pattern seen in the FOP plot depends on the range of the EV, which in turn depends on the extent of the study area.
Here we examine FOP plots for 2 of the continuous explanatory variables:
teraspifFOP <- plotFOP(grasslandPO, "teraspif")
terslpdgFOP <- plotFOP(grasslandPO, "terslpdg")
We can change the appearance of the plot with additional arguments, or access the plot data directly:
terslpdgFOP <- plotFOP(grasslandPO, "terslpdg", intervals = 25, pch=20, cex=1.1, col = "red")
terslpdgFOP$FOPdata
#> int n intEV intRV smoothRV
#> 1 [0.000,0.380) 1829 0.1974441 0.06232914 NA
#> 2 [0.380,0.761) 2844 0.5750047 0.06645570 NA
#> 3 [0.761,1.141) 3047 0.9473344 0.06005907 0.06181457
#> 4 [1.141,1.522) 2542 1.3258624 0.06648308 0.06107528
#> 5 [1.522,1.902) 1981 1.7033925 0.04896517 0.05372778
#> 6 [1.902,2.283) 1517 2.0811801 0.05141727 0.05284629
#> 7 [2.283,2.663) 1144 2.4662042 0.05419580 0.05489220
#> 8 [2.663,3.043) 812 2.8471670 0.05665025 0.05930401
#> 9 [3.043,3.424) 538 3.2173911 0.07063197 0.06589245
#> 10 [3.424,3.804) 372 3.6004458 0.06989247 0.06610008
#> 11 [3.804,4.185) 264 3.9885431 0.05681818 0.06119433
#> 12 [4.185,4.565) 182 4.3489249 0.06043956 0.06324177
#> 13 [4.565,4.946) 128 4.7418206 0.06250000 0.06742092
#> 14 [4.946,5.326) 93 5.1260939 0.09677419 0.08129914
#> 15 [5.326,5.706) 56 5.5293487 0.05357143 0.07129972
#> 16 [5.706,6.087) 35 5.9119111 0.11428571 0.08431723
#> 17 [6.087,6.467) 40 6.2669473 0.02500000 0.07247652
#> 18 [6.467,6.848) 21 6.6230410 0.09523810 0.10865229
#> 19 [6.848,7.228) 15 7.0089600 0.26666667 0.15205428
#> 20 [7.228,7.609) 10 7.4276010 0.00000000 0.05531730
#> 21 [7.609,7.989) 4 7.7217125 0.00000000 0.01798688
#> 22 [7.989,8.369) 3 8.1058232 0.00000000 0.00000000
#> 23 [8.750,9.130) 1 9.0068598 0.00000000 NA
#> 24 [9.130,9.511] 1 9.5107498 0.00000000 NA
Now we examine FOP plots for 1 of the categorical explanatory variables:
geobergFOP <- plotFOP(grasslandPO, 10)
We see that geoberg type 4 has the highest rate of observed presence, followed by type 2, and then types 3 and 28. If we look more closely at the data however, we notice also that geoberg type 4 is sampled very rarely, with only 6 locations falling into that category:
geobergFOP
#> $EVoptimum
#> [1] 4
#> Levels: 0 2 3 4 5 21 22 26 27 28 29 35 40 62 82 85
#>
#> $FOPdata
#> n level levelRV
#> 1 5 0 0.00000000
#> 2 20 2 0.45000000
#> 3 74 3 0.28378378
#> 4 6 4 0.50000000
#> 5 456 5 0.10526316
#> 6 3448 21 0.05974478
#> 7 46 22 0.06521739
#> 8 14 26 0.00000000
#> 9 21 27 0.09523810
#> 10 58 28 0.27586207
#> 11 6 29 0.00000000
#> 12 369 35 0.11382114
#> 13 1 40 0.00000000
#> 14 8183 62 0.05230356
#> 15 4445 82 0.05939258
#> 16 327 85 0.05198777
It is recommended that FOP plots for all explanatory variables be examined before building a model.
Looking at FOP plots can help the modeler decide which explanatory variables are likely to have greatest explanatory power, and gives an idea of the strength and shape of the relationships between the explanatory and response variables.
To fit the many different kinds of relationships between explanatory and response variables, we need to transform the explanatory variables. This means that we create new “derived” variables (DVs) from the original explanatory variables. Another way of thinking about this is to put it in terms of rescaling the explanatory variable; we adjust the scale of the explanatory variable in many different ways in order to check which scaling is most ecologically relevant to the occurrence of the modeled target.
The deriveVars
function produces derived variables from explanatory variables by 7 different transformation types: linear, monotonous, deviation, forward hinge, reverse hinge, threshold, and binary. The first 6 of these are relevant for continuous variables and the binary transformation is relevant only for categorical variables. Different types of transformations can be turned on or off in order to balance model complexity with model fit.
For the spline-type transformations (forward hinge, reverse hinge, threshold) an endless number of different transformations are possible, so the function produces 20 of each, and then chooses those which explain the most variation in the response variable. This means that 20 models are built and evaluated for each combination of explanatory variable and spline transformation. Therefore, running deriveVars
with these transformation types turned on may take some time.
Here we produce all types of derived variables from our explanatory variables:
grasslandDVs <- deriveVars(grasslandPO, transformtype = c("L", "M", "D", "HF", "HR", "T", "B"))
Use the
dir
argument to specify the directory to which files will be written in thederiveVars
,selectDVforEV
,selectEV
, andplotResp
functions.
The console output of deriveVars
(which is saved above as grasslandDVs
) consists of 2 parts:
summary(grasslandDVs$EVDV) # alternatively: summary(grasslandDVs[[1]])
#> Length Class Mode
#> pca1 13 data.frame list
#> pr-bygall 5 data.frame list
#> pr-tilany 12 data.frame list
#> teraspif 11 data.frame list
#> terdem 8 data.frame list
#> terslpdg 8 data.frame list
#> tersolrade 6 data.frame list
#> tertpi09 8 data.frame list
#> geoberg 16 data.frame list
#> geolmja1 15 data.frame list
#> lcucor1 21 data.frame list
#> lcutil-t4 2 data.frame list
#> terslpps15 6 data.frame list
head(summary(grasslandDVs$transformations)) # alternatively: head(summary(grasslandDVs[[2]]))
#> Length Class Mode
#> pca1_L_transf 1 -none- function
#> pca1_M_transf 1 -none- function
#> pca1_D0.5_transf 1 -none- function
#> pca1_D1_transf 1 -none- function
#> pca1_D2_transf 1 -none- function
#> pca1_HF5_transf 1 -none- function
length(grasslandDVs$transformations)
#> [1] 131
Note that the names of derived variables indicate the type of transformation was used to create them. For example, “terslpdg_D2” is a deviation-type transformation of terslpdg, where the slope of the deviation is controlled by a parameter value of 2. Meanwhile, “terslpdg_HR4” is the a reverse hinge transformation, with the knot in the 4th position.
Underscores (_) are used to denote derived variables, and colons (:) are used to denote interaction terms, so explanatory variable names must not contain these characters. EV names should also be unique.
We can check how derived variables relate to the original, untransformed explanatory variable from which they came. Here we look at “terslpdg_D2” and “terslpdg_HR4”:
plot(grasslandPO$terslpdg, grasslandDVs$EVDV$terslpdg$terslpdg_D2, pch = 20)
plot(grasslandPO$terslpdg, grasslandDVs$EVDV$terslpdg$terslpdg_HR4, pch = 20)
The models used to select spline-type derived variables are written to file and the process of spline selection can be inspected by navigating to the directory to which the files were written (default is the working directory). Inside the “deriveVars” folder are folders for each explanatory variable, inside of which are folders for the spline transformation types. Inside each of these folders are 2 files of particular interest: the “splineselection.csv” file gives the model details for each of the 20 candidate derived variables, while the “Vknotplot.png” file shows the variation explained by the 20 candidate variables. Selected candidates are marked in red. Below is an example of one of the V-knot plots:
With derived variables ready, we are ready to begin the process of choosing which variables to include in the model. This is arguably the most critical process in building a model. Following the principle of parsimony, the aim in selecting variables is to explain as much variation in the response variable as efficiently as possible. The greater the number of explanatory or derived variables included in the model, the more variation in the response variable we can explain, but at the cost of model complexity. In the MIAmaxent
package, the benefit of additional variation explained is weighed against the cost in model complexity using an F-test. Variables are added to the model one by one in a process of forward selection, and each new model is compared to its predecessor. Another term for this process is “nested model comparison”.
Rather than selecting from the full pool of derived variables one by one, MIAmaxent
performs variable selection in two parts:
selectDVforEV
function.selectEV
function.Variable selection occurs hierarchically: first derived variables within each explanatory variable, then explanatory variables within the full model.
The selectDVforEV
function performs forward selection of individual derived variables (DVs) for each explanatory variable (EV). In other words, the function takes each EV one by one, and narrows the group of DVs produced from that EV (by deriveVars
) to a group which explains variation in the response variable most efficiently.
The alpha-value specified in the function is used in the F-test during forward selection, and sets the threshold for how much variation a DV must explain to be retained. Lower alpha values signify a more conservative test, such that DVs must explain more variation to be included.
Here we use selectDVforEV
on the grassland data set. Note the “[[1]]” following grasslandsDV, which specifies the list of DVs in the deriveVars
output (see ?deriveVars
Value).
grasslandDVselect <- selectDVforEV(grasslandPO, grasslandDVs[[1]], alpha = 0.001)
#> Forward selection of DVs for 13 EVs
#>
|
| | 0%
|
|===== | 8%
|
|========== | 15%
|
|=============== | 23%
|
|==================== | 31%
|
|========================= | 38%
|
|============================== | 46%
|
|=================================== | 54%
|
|======================================== | 62%
|
|============================================= | 69%
|
|================================================== | 77%
|
|======================================================= | 85%
|
|============================================================ | 92%
|
|=================================================================| 100%
The output of selectDVforEV consists of 2 main parts:
We can see that selectDVforEV
has reduced the number of derived variables by comparing the list of DVs before and after:
summary(grasslandDVs$EVDV)
#> Length Class Mode
#> pca1 13 data.frame list
#> pr-bygall 5 data.frame list
#> pr-tilany 12 data.frame list
#> teraspif 11 data.frame list
#> terdem 8 data.frame list
#> terslpdg 8 data.frame list
#> tersolrade 6 data.frame list
#> tertpi09 8 data.frame list
#> geoberg 16 data.frame list
#> geolmja1 15 data.frame list
#> lcucor1 21 data.frame list
#> lcutil-t4 2 data.frame list
#> terslpps15 6 data.frame list
sum(sapply(grasslandDVs$EVDV, length))
#> [1] 131
summary(grasslandDVselect$selectedDV)
#> Length Class Mode
#> pca1 1 data.frame list
#> pr-bygall 1 data.frame list
#> pr-tilany 1 data.frame list
#> teraspif 1 data.frame list
#> terdem 2 data.frame list
#> tertpi09 2 data.frame list
#> geoberg 6 data.frame list
#> geolmja1 5 data.frame list
#> lcucor1 7 data.frame list
#> terslpps15 1 data.frame list
sum(sapply(grasslandDVselect$selectedDV, length))
#> [1] 27
Note also that the number of explanatory variables was reduced from 13 to 10. Explanatory variables for which none of the derived variables explained a significant amount of variance are dropped.
Here is an example of one of the trails of forward DV selection. Shown is the trail for the “terdem” EV:
grasslandDVselect$selection$terdem[, -13]
#> round model DV m trainAUC Entropy FVA
#> 1 1 7 terdem_HR18 1 0.5866 9.7239 0.015998801
#> 2 1 2 terdem_M 1 0.5865 9.7240 0.015963133
#> 3 1 1 terdem_L 1 0.5864 9.7241 0.015927466
#> 4 1 3 terdem_D0.5 1 0.5882 9.7247 0.015713461
#> 5 1 6 terdem_HR4 1 0.5624 9.7252 0.015535124
#> 6 1 4 terdem_D1 1 0.5878 9.7271 0.014857442
#> 7 1 5 terdem_D2 1 0.5830 9.7322 0.013038401
#> 8 1 8 terdem_T8 1 0.5571 9.7430 0.009186314
#> 9 2 4 terdem_HR18 terdem_HR4 2 0.5867 9.7188 0.017817842
#> 10 2 7 terdem_HR18 terdem_T8 2 0.5842 9.7234 0.016177138
#> 11 2 5 terdem_HR18 terdem_D1 2 0.5841 9.7236 0.016105803
#> 12 2 3 terdem_HR18 terdem_D0.5 2 0.5892 9.7237 0.016070136
#> 13 2 1 terdem_HR18 terdem_M 2 0.5866 9.7239 0.015998801
#> 14 2 2 terdem_HR18 terdem_L 2 0.5866 9.7239 0.015998801
#> 15 2 6 terdem_HR18 terdem_D2 2 0.5866 9.7239 0.015998801
#> addedFVA Fstatistic dfe dfu Pvalue
#> 1 1.599880e-02 266.922757 1 16417 0.000000e+00
#> 2 1.596313e-02 266.318030 1 16417 0.000000e+00
#> 3 1.592747e-02 265.713347 1 16417 0.000000e+00
#> 4 1.571346e-02 262.086171 1 16417 0.000000e+00
#> 5 1.553512e-02 259.064728 1 16417 0.000000e+00
#> 6 1.485744e-02 247.593223 1 16417 0.000000e+00
#> 7 1.303840e-02 216.879182 1 16417 0.000000e+00
#> 8 9.186314e-03 152.209963 1 16417 0.000000e+00
#> 9 1.819041e-03 30.403095 1 16416 3.562622e-08
#> 10 1.783374e-04 2.975725 1 16416 8.454108e-02
#> 11 1.070024e-04 1.785305 1 16416 1.815177e-01
#> 12 7.133494e-05 1.190160 1 16416 2.753132e-01
#> 13 0.000000e+00 0.000000 1 16416 1.000000e+00
#> 14 0.000000e+00 0.000000 1 16416 1.000000e+00
#> 15 0.000000e+00 0.000000 1 16416 1.000000e+00
The columns in this data frame represent: the round of variable selection (round), the name of the model per round (model), the names of the derived variables included in the model (DV), the number of variables in the model (m), the training AUC (trainAUC), the Entropy or log loss of the model (Entropy), the fraction of variation explained (FVA), the additional variation explained compared to the best model of the previous round (addedFVA), the degrees of freedom of the explained variance (dfe), the degrees of freedom of the unexplained variance (dfu), and the p-value of the F-test comparing the model to the previous round’s best model (Pvalue). The model directory, recorded in column 13, is not shown.
This table shows, for example, that in the first round of selection, one model was built for each of the 8 derived variables, and that all of these explained enough variation to be retained for the second round of selection. Of all the derived variables produced from “terdem”, “terdem_HR18” explained the most variation in the response variable. Furthermore, “terdem_HR4” explained enough of the remaining variation to be selected as well (Pvalue < alpha).
Part 2 of variable selection using MIAmaxent
is performed by the selectEV
function. This function is similar to the selectDVforEV
function, but instead of selecting parsimonious groups of derived variables to represent each explanatory variable, selectEV
selects explanatory variables. This proceeds in the same manner as selectDVforEV
, with nested model comparison using the F-test. Where selectDVforEV
adds a single DV at a time, selectEV
adds a single group of DVs, representing a single EV, at a time.
The selectEV
function also differs from selectDVforEV
in one other very important way; it includes the option of including interaction terms between selected explanatory variables in the model (interaction = TRUE
). Interaction terms between variables are useful when one explanatory variable changes the effect of another explanatory variable on the modeled target. In MIAmaxent
interaction terms are produced by simple multiplication between all pairs of derived variables representing the two interacting explanatory variables. Interaction terms are only allowed between EVs which are both included in the model.
Here we use selectEV
on the grassland data set. Note the “[[1]]” following grasslandsDVselect, which specifies the list of selected DVs in the selectDVforEV
output (see ?selectDVforEV
Value).
grasslandEVselect <- selectEV(grasslandPO, grasslandDVselect[[1]], alpha = 0.001, interaction = TRUE)
#> Forward selection of 10 EVs
#> Round 1 complete.
#> Round 2 complete.
#> Round 3 complete.
#> Round 4 complete.
#> Round 5 complete.
#> Round 6 complete.
#> Round 7 complete.
#> Round 8 complete.
#> Round 9 complete.
#> Forward selection of interaction terms between 9 EVs
#> Round 10 complete.
#> Round 11 complete.
#> Round 12 complete.
#> Round 13 complete.
The output of selectEV consists of 2 main parts:
Now compare the list of EVs before and after:
summary(grasslandDVselect[[1]])
#> Length Class Mode
#> pca1 1 data.frame list
#> pr-bygall 1 data.frame list
#> pr-tilany 1 data.frame list
#> teraspif 1 data.frame list
#> terdem 2 data.frame list
#> tertpi09 2 data.frame list
#> geoberg 6 data.frame list
#> geolmja1 5 data.frame list
#> lcucor1 7 data.frame list
#> terslpps15 1 data.frame list
length(grasslandDVselect[[1]])
#> [1] 10
summary(grasslandEVselect[[1]])
#> Length Class Mode
#> pr-bygall 1 data.frame list
#> terslpps15 1 data.frame list
#> pca1 1 data.frame list
#> lcucor1 7 data.frame list
#> geoberg 6 data.frame list
#> teraspif 1 data.frame list
#> tertpi09 2 data.frame list
#> pr-tilany 1 data.frame list
#> geolmja1 5 data.frame list
#> pr-bygall:terslpps15 1 data.frame list
#> terslpps15:lcucor1 7 data.frame list
#> terslpps15:geoberg 6 data.frame list
#> pr-bygall:geoberg 6 data.frame list
length(grasslandEVselect[[1]])
#> [1] 13
We can see that selectEV
has reduced the number of explanatory variables to 9, but it has also added new interaction terms between explanatory variables selected for the model.
Here we show the best model of each round in the trail of forward selection of EVs and interaction terms, hiding a few columns for readability:
#> round model m trainAUC FVA addedFVA Fstatistic Pvalue
#> 1 1 2 1 0.6240 0.03665027 0.0366502665 624.578389 0.000000e+00
#> 11 2 1 2 0.6425 0.04538880 0.0087385304 150.272398 0.000000e+00
#> 20 3 2 3 0.6493 0.04881287 0.0034240772 59.090610 1.598721e-14
#> 28 4 1 10 0.6696 0.06892933 0.0201164536 50.643811 0.000000e+00
#> 35 5 1 16 0.6863 0.08333899 0.0144096583 42.972486 0.000000e+00
#> 41 6 1 17 0.6882 0.08558604 0.0022470507 40.303276 2.231570e-10
#> 45 7 1 19 0.6908 0.08836810 0.0027820627 25.022735 1.410305e-11
#> 48 8 1 20 0.6918 0.08922412 0.0008560193 15.412139 8.678455e-05
#> 50 9 1 25 0.6941 0.09246986 0.0032457399 11.725762 2.445744e-11
#> 51 10 1 26 0.6947 0.09314754 0.0006776819 12.249581 4.665984e-04
#> 87 11 2 33 0.6975 0.09596527 0.0028177302 7.295628 9.196942e-09
#> 92 12 2 39 0.6991 0.09778431 0.0018190410 5.503871 1.051371e-05
#> 95 13 1 45 0.7016 0.09935368 0.0015693687 4.754969 7.534903e-05
The full selection trail is automatically saved as a CSV-format file in the directory specified by the dir
argument in selectEV
. It is recommended to open this file (named: “evselection.csv”) using spreadsheet software to inspect the trail of forward selection more closely. This allows the user to make an informed decision about what is the best model.
The trail of forward variable selection is automatically written to file (in the specified
dir
), and should be examined using spreadsheet software.
We can also look at the selection trail in graphical form. For example, we may plot additional variation explained across round number, to see how much better the model fit is for each round of adding a variable:
plot(grasslandEVselect$selection$round, grasslandEVselect$selection$addedFVA)
For example, we may decide that a simple model with only 5 explanatory variables is best, since rounds 6-13 explain only small amounts of additional variation. Thus, we choose the best model in round 5:
grasslandEVselect[[2]][grasslandEVselect$selection$round == 5, ][1, -13]
#> round model EV m trainAUC
#> 35 5 1 pr-bygall terslpps15 pca1 lcucor1 geoberg 16 0.6863
#> Entropy FVA addedFVA Fstatistic dfe dfu Pvalue
#> 35 9.5351 0.08333899 0.01440966 42.97249 6 16402 0
The explanatory variables included in this model are: pr-bygall, terslpps15, pca1, lcucor1, and geoberg.
After building a model by selecting which explanatory variable to include, it is useful to explore what the model actually looks like. A straightforward way to do this is to look at model predictions over a range of explanatory data. This gives the modeler a sense of the relationships that the model has captured between explanatory variables and response variables, and can also help the modeler understand strengths and weaknesses in the model.
We can use the plotResp
function to create a so-called response plot. A response plot graphs model output across the range of one particular explanatory variable. When other variables are excluded from the model entirely, this is called a “single-effect response plot”.
Here we examine a single-effect response plot for the most important explanatory variable included in the model chosen above:
prbygallResp <- plotResp(grasslandPO, grasslandEVselect[[1]], "pr-bygall")
To assess how well the relationship between the explanatory variable and the response variable is captured by the model, it can be useful to examine FOP plots and response plots side-by-side:
prbygallFOP <- plotFOP(grasslandPO, "pr-bygall", intervals=50)
prbygallResp <- plotResp(grasslandPO, grasslandEVselect[[1]], "pr-bygall")
The values on the y-axes of the plots are not directly comparable, but one can expect that the shape of the response plot curve should mirror, to a greater or lesser extent, the shape of the FOP plot curve.
Here is the same comparison for one of the categorical variables included in the model:
terslpps15FOP <- plotFOP(grasslandPO, "terslpps15")
terslpps15Resp <- plotResp(grasslandPO, grasslandEVselect[[1]], "terslpps15")