stagedtrees

stagedtrees Build Status Coverage status

v2.0.0

The current version of stagedtrees available on CRAN is a major update over the previous version (1.0.2). The update will almost surely break any code written with v1.0.2. Functions naming and functions parameters have been updated to simplify user experience. Check the complete changelog for details.

Preprint

F Carli, M Leonelli, E Riccomagno, G Varando, The R Package stagedtrees for Structural Learning of Stratified Staged Trees, 2020 arXiv:2004.06459

@misc{2004.06459,
Author = {Federico Carli and Manuele Leonelli and Eva Riccomagno and Gherardo Varando},
Title = {The R Package stagedtrees for Structural Learning of Stratified Staged Trees},
Year = {2020},
Eprint = {arXiv:2004.06459},
}

Overview

stagedtrees is a package that implements staged event trees, a probability model for discrete random variables.

Installation

#stable version from CRAN 
install.packages("stagedtrees")

#development version from github
# install.packages("devtools")
devtools::install_github("gherardovarando/stagedtrees")

Usage

library("stagedtrees")

With the stagedtrees package it is possible to fit (stratified) staged event trees to data, use them to compute probabilities, make predictions, visualize and compare different models.

Creating the model

A staged event tree object (sevt class) can be initialized as the full (saturated) or as the fully independent model with, respectively, the functions indep and full. It is possible to build a staged event tree from data stored in a data.frame or a table object.

# Load the PhDArticles data
data("PhDArticles")

# Independence model 
mod_indep <- indep(PhDArticles, lambda = 1)
mod_indep
#> Staged event tree (fitted) 
#> Articles[3] -> Gender[2] -> Kids[2] -> Married[2] -> Mentor[3] -> Prestige[2]  
#> 'log Lik.' -4407.498 (df=11)

# Full (saturated) model
mod_full <- full(PhDArticles, lambda = 1) 
mod_full
#> Staged event tree (fitted) 
#> Articles[3] -> Gender[2] -> Kids[2] -> Married[2] -> Mentor[3] -> Prestige[2]  
#> 'log Lik.' -4066.97 (df=116)

# Full model with not-observed situations joined in NA stages
mod_full0 <- full(PhDArticles, join_unobserved = TRUE, lambda = 1)
mod_full0
#> Staged event tree (fitted) 
#> Articles[3] -> Gender[2] -> Kids[2] -> Married[2] -> Mentor[3] -> Prestige[2]  
#> 'log Lik.' -4066.97 (df=116)

Model selection

stagedtrees implements methods to perform automatic model selection. All methods can be initialized from an arbitrary staged event tree object.

Score methods

This methods perform optimization for a given score using different heuristics.

mod1 <- stages_hc(mod_indep)
mod1
#> Staged event tree (fitted) 
#> Articles[3] -> Gender[2] -> Kids[2] -> Married[2] -> Mentor[3] -> Prestige[2]  
#> 'log Lik.' -4118.434 (df=17)
mod2 <- stages_bhc(mod_full)
mod2
#> Staged event tree (fitted) 
#> Articles[3] -> Gender[2] -> Kids[2] -> Married[2] -> Mentor[3] -> Prestige[2]  
#> 'log Lik.' -4086.254 (df=22)
mod3 <- stages_fbhc(mod_full, score = function(x) -BIC(x))
mod3
#> Staged event tree (fitted) 
#> Articles[3] -> Gender[2] -> Kids[2] -> Married[2] -> Mentor[3] -> Prestige[2]  
#> 'log Lik.' -4146.642 (df=17)
Distance methods
mod4 <- stages_bj(mod_full)
mod4
#> Staged event tree (fitted) 
#> Articles[3] -> Gender[2] -> Kids[2] -> Married[2] -> Mentor[3] -> Prestige[2]  
#> 'log Lik.' -4090.79 (df=25)
Clustering methods
mod5 <- stages_hclust(mod_full0,
                    k = 2, 
                    distance = "totvar",
                   method = "mcquitty")
mod5
#> Staged event tree (fitted) 
#> Articles[3] -> Gender[2] -> Kids[2] -> Married[2] -> Mentor[3] -> Prestige[2]  
#> 'log Lik.' -4122.274 (df=17)
mod6 <- stages_kmeans(mod_full0,
                    k = 2, 
                   algorithm = "Hartigan-Wong")
mod6
#> Staged event tree (fitted) 
#> Articles[3] -> Gender[2] -> Kids[2] -> Married[2] -> Mentor[3] -> Prestige[2]  
#> 'log Lik.' -4119.247 (df=17)

Combining model selections with %>%

The pipe operator from the magrittr package can be used to combine easily various model selection algorithms and to specify models easily.

library(magrittr)
model <- PhDArticles %>% full(lambda = 1) %>% 
           stages_hclust %>% stages_hc

## extract a sub_tree and join two stages
sub_model <- model %>% subtree(path = c(">2"))  %>%  
              join_stages("Mentor", "1", "2")

Probabilities, predictions and sampling

Marginal probabilities

Obtain marginal probabilities with the prob function.

# estimated probability of c(Gender = "male", Married = "yes")
# using different models
prob(mod_indep, c(Gender = "male", Married = "yes")) 
#> [1] 0.3573183
prob(mod3, c(Gender = "male", Married = "yes"))
#> [1] 0.3934668

Or for a data.frame of observations:

obs <- expand.grid(mod_full$tree[c(2,3,5)])
p <- prob(mod2, obs)
cbind(obs, P = p)
#>    Gender Kids Mentor          P
#> 1    male  yes    low 0.07222511
#> 2  female  yes    low 0.03176136
#> 3    male   no    low 0.09832136
#> 4  female   no    low 0.11463987
#> 5    male  yes medium 0.09912549
#> 6  female  yes medium 0.03451126
#> 7    male   no medium 0.10643086
#> 8  female   no medium 0.14830958
#> 9    male  yes   high 0.08649223
#> 10 female  yes   high 0.02188517
#> 11   male   no   high 0.07702539
#> 12 female   no   high 0.10927233
Predictions

A staged event tree object can be used to make predictions with the predict method. The class variable can be specified, otherwise the first variable (root) in the tree will be used.

## check accuracy over the PhDArticles data
predicted <- predict(mod3, newdata = PhDArticles)
table(predicted, PhDArticles$Articles)
#>          
#> predicted   0 1-2  >2
#>       0    32  34  19
#>       1-2 225 351 149
#>       >2   18  39  48

Conditional probabilities (or log-) can be obtained setting prob = TRUE:

## obtain estimated conditional probabilities in mod3 for first 5 obs
## P(Articles|Gender, Kids, Married, Mentor, Prestige)
predict(mod3, newdata = PhDArticles[1:5,], prob = TRUE)
#>           0       1-2        >2
#> 1 0.2853346 0.4393739 0.2752915
#> 2 0.3186093 0.4906121 0.1907785
#> 3 0.3186093 0.4906121 0.1907785
#> 4 0.3450547 0.5313342 0.1236111
#> 5 0.2304826 0.6315078 0.1380096
Sampling
sample_from(mod4, 5)
#>   Articles Gender Kids Married Mentor Prestige
#> 1      1-2 female   no     yes medium     high
#> 2        0   male  yes     yes    low     high
#> 3        0 female   no      no medium      low
#> 4      1-2   male  yes     yes    low     high
#> 5      1-2 female   no     yes   high      low

Explore the model

Model info
# variables 
variable.names(mod1)
#> NULL

# stages
stages(mod1, "Kids")
#> [1] "1" "3" "1" "3" "1" "3"

# summary
summary(mod1)
#> Call: 
#> stages_hc(mod_indep)
#> lambda:  1 
#> Stages: 
#>   Variable:  Articles 
#>  stage npaths sample.size         0      1-2        >2
#>      1      0         915 0.3006536 0.462963 0.2363834
#>   ------------ 
#>   Variable:  Gender 
#>  stage npaths sample.size      male    female
#>      1      2         699 0.5121255 0.4878745
#>      3      1         216 0.6284404 0.3715596
#>   ------------ 
#>   Variable:  Kids 
#>  stage npaths sample.size       yes        no
#>      1      3         494 0.4778226 0.5221774
#>      3      3         421 0.1914894 0.8085106
#>   ------------ 
#>   Variable:  Married 
#>  stage npaths sample.size          no       yes
#>      3      6         316 0.003144654 0.9968553
#>      1      6         599 0.515806988 0.4841930
#>   ------------ 
#>   Variable:  Mentor 
#>       stage npaths sample.size       low    medium      high
#>  UNOBSERVED      6           0 0.3333333 0.3333333 0.3333333
#>           1     11         625 0.3917197 0.3773885 0.2308917
#>           4      7         290 0.1604096 0.4129693 0.4266212
#>   ------------ 
#>   Variable:  Prestige 
#>       stage npaths sample.size       low      high
#>  UNOBSERVED     18           0 0.5000000 0.5000000
#>           1     30         540 0.6236162 0.3763838
#>           4     24         375 0.3262599 0.6737401
#>   ------------
Plot
plot(mod4, main = "Staged tree learned with bj.sevt", 
     cex_label_edges = 0.6, cex_nodes = 1.5)

By default stages associated with the unobserved situations are not plotted, if the model has been created with join_unobserved = TRUE.

plot(stndnaming(mod5, uniq = TRUE), 
     main = "Staged tree learned with stages_hclust 
     (structural zeroes)", 
     col = "stages",
     cex_label_edges = 0.6, cex_nodes = 1.5)

Barplot

barplot_stages permit to easily plot barplots (via barplot) representing the different probabilities defined for the different stages of a variable.

barplot(mod4, "Kids", legend.text = TRUE)

Subtrees

A subtree can be extracted, the result is another staged event tree object in the remaining variables.

sub <- subtree(mod4, c(">2", "female"))
plot(sub)

Comparing models

Check if models are equal.

compare_stages(mod1, mod2)
#> [1] FALSE

compare_stages(mod1, mod2, method = "hamming", plot = TRUE, 
             cex_label_nodes = 0, cex_label_edges = 0)

#> [1] FALSE

hamming_stages(mod1, mod2)
#> [1] 29

difftree <- compare_stages(mod1, mod2, method = "stages", plot = FALSE, 
             return_tree = TRUE)

difftree$Married
#>  [1] 0 1 0 1 0 1 0 1 0 1 0 1

Penalized log-likelihood.

BIC(mod_indep, mod_full, mod1, mod2, mod3, mod4, mod5)
#>            df      BIC
#> mod_indep  11 8890.005
#> mod_full  116 8924.936
#> mod1       17 8352.790
#> mod2       22 8322.524
#> mod3       17 8409.206
#> mod4       25 8352.053
#> mod5       17 8360.471