The R package case-based-reasoning provides an R interface case based reasoning using machine learning.
install.packages("CaseBasedReasoning")
install.packages("devtools")
devtools::install_github("sipemu/case-based-reasoning")
This R package provides two methods case based reasoning by using an endpoint:
Linear, logistic, and Cox regression
Proximity and Depth Measure extracted from a fitted random forest (ranger package)
Besides the functionality of searching similar cases, some additional features are included:
automatic validation of the key variables between the query and similar cases dataset
checking proportional hazard assumption for the Cox Model
C++-functions for distance calculation
In the first example, we use the Cox-Model and the ovarian
data set from the survival
package. In the first step we initialize the R6 data object.
library(tidyverse)
library(survival)
library(CaseBasedReasoning)
ovarian$resid.ds <- factor(ovarian$resid.ds)
ovarian$rx <- factor(ovarian$rx)
ovarian$ecog.ps <- factor(ovarian$ecog.ps)
# initialize R6 object
coxBeta <- CoxBetaModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps)
After the initialization, we may want to get for each case in the query data the most similar case from the learning data.
n <- nrow(ovarian)
trainID <- sample(1:n, floor(0.8 * n), F)
testID <- (1:n)[-trainID]
# fit model
ovarian[trainID, ] %>%
coxBeta$fit()
# get similar cases
ovarian[testID, ] %>%
coxBeta$get_similar_cases(queryData = ovarian[testID, ], k = 3) -> matchedData
You may extract then the similar cases and the verum data and put them together:
Note 1: In the initialization step, we dropped all cases with missing values in the variables of data
and endPoint
. So, you need to make sure that NA handling is done by you.
Note 2: The data.table
returned from coxBeta$get_similar_cases
has four additional columns:
caseId
: By this column you may map the similar cases to cases in data, e.g. if you had chosen k = 3
, then the first three elements in the column caseId
will be 1
(following three 2
and so on). This means that this three cases are the three most similar cases to case 0
in verum data.scDist
: The calculated distancescCaseId
: Grouping number of query with matched datagroup
: Grouping matched or query dataAlternatively, you may just be interested in the distance matrix, then you go this way:
ovarian %>%
coxBeta$calc_distance_matrix() -> distMatrix
coxBeta$calc_distance_matrix()
calculates the full distance matrix. This matrix the dimension: cases of data versus cases of query data. If the query dataset is bot available, this functions calculates a n times n distance matrix of all pairs in data. The distance matrix is saved internally in the cbrCoxModel object: coxBeta$distMat
.
In the second example, we present the Random Forest model for a distance measure approximation applied on the ovarian
data set from the survival
package. This package offers two ways for distance/similarity calculation (see documentation):
proximity
depth
Let’s initialize the R6 data object.
```{r, warning=FALSE, message=FALSE} library(tidyverse) library(survival) library(CaseBasedReasoning) ovarian\(resid.ds <- factor(ovarian\)resid.ds) ovarian\(rx <- factor(ovarian\)rx) ovarian\(ecog.ps <- factor(ovarian\)ecog.ps)
rfSC <- RFModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps) ```
All cases with missing values in the learning and end point variables are dropped (na.omit
) and the reduced data set without missing values is saved internally. You get a text output on how many cases were dropped. character
variables will be transformed to factor
.
Optionally, you may want to adjust some parameters in the fitting step of the random forest algorithm. Possible arguments are: , ntree
, mtry
, and splitrule
. The documentation of this parameters can be found in the ranger R-package. Furthermore, you are able to choose the two distance measures:
Proximity
: the proximity matrixDepth
(Default): Calculates the average edge length over all treesThis can be done by
{r, warning=FALSE, message=FALSE} rfSC$set_dist(distMethod = "Proximity")
All other steps (excluding checking for proportional hazard assumption are the same as for the Cox-Model).
Similar Cases:
n <- nrow(ovarian)
trainID <- sample(1:n, floor(0.8 * n), F)
testID <- (1:n)[-trainID]
# fit model
ovarian[trainID, ] %>%
rfSC$fit()
# get similar cases
ovarian[trainID, ] %>%
rfSC$get_similar_cases(queryData = ovarian[testID, ], k = 3) -> matchedData
Distance Matrix Calculation:
ovarian %>%
rfSC$calc_distance_matrix() -> distMatrix
PD Dr. Jürgen Dippon, Institut für Stochastik und Anwendungen, Universität Stuttgart
Dr. Simon Müller, TTI GmbH - MUON-STAT
Dr. Peter Fritz
Professor Dr. Friedel
The work was funded by the Robert Bosch Foundation. Special thanks go to Professor Dr. Friedel (Thoraxchirugie - Klinik Schillerhöhe).
Dippon et al. A statistical approach to case based reasoning, with application to breast cancer data (2002),
Friedel et al. Postoperative Survival of Lung Cancer Patients: Are There Predictors beyond TNM? (2012).
Englund and Verikas A novel approach to estimate proximity in a random forest: An exploratory study
Stuart, E. et al. Matching methods for causal inference: Designing observational studies
Defossez et al. Temporal representation of care trajectories of cancer patients using data from a regional information system: an application in breast cancer