The D2C (Dependence to Causality) algorithm aims to infer the existence of causal dependencies from static (i.e. non temporal) observational and multivariate data.
The D2C algorithm predicts the existence of a direct causal link between two variables in a multivariate setting by (i) creating a set of of features of the relationship based on asymmetric descriptors of the multivariate dependency and (ii) using a classifier (in this case a Random Forest) to learn a mapping between the features and the presence of a causal link. The approach relies on the asymmetry of some conditional (in)dependence relations between the members of the Markov blankets of two variables causally connected.
This vignette shows how to create a training set for D2C, how to create a D2C object and how to assess it with a number of simulated datasets. A comparison with two inference algorithms implemented in the bnearn
package is provided too.
Let us start by creating a set of Directed Acyclic Graphs (DAGs) and the associated simulated datasets.
The object simulatedDAG
can be initialized by setting the number NDAG
of DAGs, the range noNodes
of the number of nodes, the range N
of observed samples, the type functionType
of dependency and the range sdn
of standard deviation of the additive noise.
rm(list=ls())
library(D2C)
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
require(RBGL)
## Loading required package: RBGL
## Loading required package: graph
require(gRbase)
## Loading required package: gRbase
noNodes<-c(10,20)
## range of number of nodes
N<-c(50,200)
## range of number of samples
sd.noise<-c(0.2,1)
## range of values for standard deviation of additive noise
NDAG=100
## number of DAGs to be created and simulated
trainDAG<-new("simulatedDAG",NDAG=NDAG, N=N, noNodes=noNodes,
functionType = c("linear","quadratic","sigmoid"),
seed=0,sdn=sd.noise,quantize=c(TRUE,FALSE),verbose=FALSE)
We can now check the number of DAGs
print(trainDAG@NDAG)
## [1] 100
If it happens that the number is smaller than NDAG
, this is due to the fact that DAGs with no edges are removed.
Let us visualize also the first simulated DAG and the dimension of the related datasets:
print(trainDAG@list.DAGs[[1]])
## A graphNEL graph with directed edges
## Number of Nodes = 16
## Number of Edges = 27
print(dim(trainDAG@list.observationsDAGs[[1]]))
## [1] 90 16
Once we have a training set of DAGs, we can create a D2C object by storing for (a subset of) the edges of each DAG a vector of descriptors and the related binary label (0 or 1) about the existance of a causal link.
The parameters of the D2C descriptors are contained in a D2C.descriptor
object
descr.example<-new("D2C.descriptor",bivariate=FALSE,ns=3,acc=TRUE,lin=TRUE)
trainD2C<-new("D2C",sDAG=trainDAG,
descr=descr.example,ratioEdges=1,max.features=30,verbose=FALSE)
Let us print some properties of the trainD2C
object:
print(dim(trainD2C@X))
## [1] 3136 47
print(table(trainD2C@Y))
##
## 0 1
## 1434 1702
Let us print out the Random forest object stored within trainD2C
print(trainD2C@mod)
##
## Call:
## randomForest(x = X[, rank], y = factor(Y))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 24.43%
## Confusion matrix:
## 0 1 class.error
## 0 1131 303 0.2113
## 1 463 1239 0.2720
Let us now create an independent test set (e.g. by creating
a second simulatedDAG
object with a different seed of the random generator) and let us store it in the testDAG
object:
NDAG.test=50
noNodes<-c(10,20)
## range of number of nodes
N<-c(50,100)
testDAG<-new("simulatedDAG",NDAG=NDAG.test, N=N, noNodes=noNodes,
functionType = c("linear","quadratic","sigmoid"), quantize=c(TRUE,FALSE),
seed=101,sdn=c(0.2,0.5),verbose=FALSE)
In this section we assess the accuracy of D2C and we compare it with two algorithms implemented in the package bnlearn
. Accuracy is measured by the Balanced Error Rate (BER). For each dataset associated to the simulated DAG contained in testDAG
, we infer the directionality of the edges and we compare it with the ground truth.
require(foreach)
if (!require(bnlearn)){
install.packages("bnlearn", repos="http://cran.rstudio.com/")
library(bnlearn)
}
FF<-foreach (r=1:testDAG@NDAG) %do%{
set.seed(r)
observedData<-testDAG@list.observationsDAGs[[r]]
trueDAG<-testDAG@list.DAGs[[r]]
## inference of networks with bnlearn package
Ahat.GS<-amat(gs(data.frame(observedData),alpha=0.01))
Ahat.IAMB<-(amat(iamb(data.frame(observedData),alpha=0.01)))
graphTRUE<- as.adjMAT(trueDAG)
## selection of a balanced subset of edges for the assessment
Nodes=nodes(trueDAG)
max.edges<-min(30,length(edgeList(trueDAG)))
subset.edges = matrix(unlist(sample(edgeList(trueDAG),
size = max.edges,replace = F)),ncol=2,byrow = TRUE)
subset.edges = rbind(subset.edges,t(replicate(n =max.edges,
sample(Nodes,size=2,replace = FALSE))))
Yhat.D2C<-NULL
Yhat.IAMB<-NULL
Yhat.GS<-NULL
Ytrue<-NULL
for(jj in 1:NROW(subset.edges)){
i =as(subset.edges[jj,1],"numeric");
j =as(subset.edges[jj,2],"numeric") ;
pred.D2C = predict(trainD2C,i,j, observedData)
Yhat.D2C<-c(Yhat.D2C,as.numeric(pred.D2C$response) -1)
Yhat.IAMB<-c(Yhat.IAMB,Ahat.IAMB[i,j])
Yhat.GS<-c(Yhat.GS,Ahat.GS[i,j])
Ytrue<-c(Ytrue,graphTRUE[subset.edges[jj,1],subset.edges[jj,2]])
}
list(Yhat.D2C=Yhat.D2C,Yhat.GS=Yhat.GS,
Yhat.IAMB=Yhat.IAMB,Ytrue=Ytrue)
}
Yhat.D2C<-unlist(lapply(FF,"[[",1) )
Yhat.GS<-unlist(lapply(FF,"[[",2))
Yhat.IAMB<-unlist(lapply(FF,"[[",3))
Ytrue<-unlist(lapply(FF,"[[",4))
## computation of Balanced Error Rate
BER.D2C<-BER(Ytrue,Yhat.D2C)
BER.GS<-BER(Ytrue,Yhat.GS)
The average accuracy of the three algorithms (the smaller the BER the better) is shown by typing:
cat("\n BER.D2C=",BER.D2C, "BER.IAMB=",BER.IAMB,"BER.GS=",BER.GS,"\n")
We invite the user to assess the impact of the number of training samples (e.g. by increasing the value NDAG
and retraining D2C) on the accuracy of the network construction.
Let us upload the alarm dataset and let us focus on a highly connected subnetwork of 100 nodes
data(alarm)
graphTRUE<-true.net
set.seed(0)
observedData<-dataset
w.const<-which(apply(observedData,2,sd)<0.1)
if (length(w.const)>0){
observedData<-observedData[,-w.const]
graphTRUE<-graphTRUE[-w.const,-w.const]
}
indn<-sort(apply(graphTRUE,1,sum)+apply(graphTRUE,2,sum),decr=TRUE,index=TRUE)$ix[1:100]
observedData<-observedData[,indn]
graphTRUE<-graphTRUE[indn,indn]
Let us first make the inference with two bnlearn
algorithms,
n<-NCOL(observedData)
Ahat.GS<-amat(gs(data.frame(observedData)))
Ahat.IAMB<-amat(iamb(data.frame(observedData),alpha=0.05))
and then with D2C…
Yhat.D2C<-NULL
Yhat.GS<-NULL
Yhat.IAMB<-NULL
Ytrue<-NULL
for (i in 1:n){
## creation of a balanced test set
ind1<-which(graphTRUE[i,]==1)
ind0<-setdiff(setdiff(1:n,i),ind1)
ind<-c(ind1,ind0[1:length(ind1)])
FF<-foreach(j=ind) %do%{
list(Yhat=as.numeric(predict(trainD2C,i,j,observedData)$response) -1,
Yhat2=Ahat.GS[i,j],Yhat3=Ahat.IAMB[i,j],Ytrue=graphTRUE[i,j])
}
Yhat.D2C<-c(Yhat.D2C,unlist(lapply(FF,"[[",1) ))
Yhat.GS<-c(Yhat.GS,unlist(lapply(FF,"[[",2)))
Yhat.IAMB<-c(Yhat.IAMB,unlist(lapply(FF,"[[",3)))
Ytrue<-c(Ytrue,unlist(lapply(FF,"[[",4)))
}
You can now print out the balanced error rates (BER) of the three techniques by using:
cat("\n BER.D2C=",BER(Ytrue,Yhat.D2C), "BER.GS=",BER(Ytrue,Yhat.GS),
"BER.IAMB=",BER(Ytrue,Yhat.IAMB),
"\n")