This is an example project showing how tangram can meet the needs of a fictional FDA report. It’s based on some actual work that is currently embargoed. From clinical trial data many tables of various summaries are required. So the goal is to demonstrate the work required to create a series of consistent tables using tangram for a project.
This example shows a few interesting things:
cell_*
functions like cell_chi_test
, since the formatting of these was not the desired.This work will most likely evolve into a tranform bundle inside the library that is available for anyone to use. To do this requires more work and consideration of every possible cross product of types that could be thrown at this transform and proper treatment of format specifiers. For the current effort this is not required and the format of the data/transform is constrained heavily by the problem at hand.
I hope that over time transforms that are useful to a broader audience get defined and encourage submissions to add to this library.
First some random data to work with (real data is confidential).
# Make up some data
N <- 10000
d1 <- data.frame(
id = 1:N,
procedure = sample(c("A","B","C","D","E","F","G",NA),
N,
replace=TRUE,
prob=c(rep(0.14,7), 0.02)),
category = sample(c("D","E", NA), N, replace=TRUE, prob=c(0.49, 0.49, 0.02)),
prior = pmin(rpois(N, 1), 5),
modality = sample(c("X","Y","Z", NA), N, replace=TRUE, prob=c(0.33, 0.33, 0.33, 0.01)),
albumin = rnorm(N, 3.5, 0.4)
)
map_procedure_cat <- c(
"Incisional",
"Parastomal",
"Incisional and Parastomal",
"Epigastric (primary hernia)",
"Umbilical (primary hernia)",
"Spigelian (primary hernia)",
"Lumbar (primary hernia)"
)
d1$prior <- factor(d1$prior, levels=0:5, labels=c("0", "1", "2", "3", "4", "5+"))
d1$procedure <- factor(d1$procedure, labels=map_procedure_cat)
label(d1$prior) <- "Number of prior hernia repairs (among recurrent)"
label(d1$category) <- "Primary or recurrent"
label(d1$procedure) <- "Ventral hernia procedure"
d1$albumin[sample(1:N,100)] <- NA
label(d1$albumin) <- "Albumin"
units(d1$albumin) <- "g/dL"
# Add a binary coded side effect variable
d1$side_effect <- sample(c(TRUE, FALSE, NA), N, replace=TRUE, prob=c(0.49, 0.49, 0.02))
d1$reported_side_effects <- sample(1:256, N, replace=TRUE)
d1$reported_side_effects[!d1$side_effect] <- NA
label(d1$reported_side_effects) <- "Reported Side Effects"
A function that performs the appropriate statistical test and returns a table cell is very helpful. In this case determination of the appropriate \(\chi^2\) test over pairs of columns in a grid of numbers.
chiTests <- function(grid)
{
lapply(2:dim(grid)[2], FUN=function(i){
consider <- grid[,c(1, i)]
if(min(consider) >= 1)
{
test <- suppressWarnings(chisq.test(consider, correct=FALSE))
stat <- unname(test$statistic) * (sum(consider)-1) / sum(consider)
cell(render_f(pchisq(stat, test$parameter, lower.tail=FALSE), 3), reference=1)
}
else
{
cell(render_f(fisher.test(consider, simulate.p.value=TRUE, B=1e7)$p.value, 3), reference=2)
}
})
}
In this example we are only considering N categories as row and M categories as column. The resulting table is (N+1) X (2M). More specifically a row for the name of the row variable and row for each category present in the row. There is a column for the count (percentage) of each column variable, then pair-wise comparisons with the first category following by the count of missing. Statistical tests only appear in the first row.
The additional logical argument display_percent
is specified to turn on and off the display of percents. By default it’s TRUE and additional arguments passed to tangram
are pushed down into these transforms so one is free to define any additional variables being passed in and out of transforms.
Further this example seeks to avoid use of the %>%
operator for instructional purposes, unlike the original example of using table_builder operators.
fda_cat_by_cat <- function(tb, row, col, display_percent=TRUE, ...)
{
grid <- table(row$data, col$data)
tests <- chiTests(grid)
colN <- lapply(colnames(grid), function(cat) cell_n(sum(grid[,cat]), subcol=cat))
rowlbl <- lapply(rownames(grid), function(x) paste(" ", x))
versus <- paste(colnames(grid)[1], "vs.", colnames(grid)[2:length(colnames(grid))])
# Now construct the table by add rows to each column
tb <- col_header(tb, colnames(grid), versus, "Missing")
tb <- col_header(tb, colN, rep("P-value", length(versus)), "")
tb <- row_header(tb, derive_label(row))
for(nm in rowlbl) tb <- row_header(tb, nm)
for(colnm in colnames(grid))
{
denom <- sum(grid[,colnm])
tb <- add_row(tb, "")
for(rownm in rownames(grid))
{
numer <- grid[rownm, colnm]
x <- if(display_percent) paste0(numer, " (", render_f(100*numer/denom, 1), ")") else
as.character(numer)
tb <- add_row(tb, cell(x, subcol = colnm, subrow = rownm))
}
tb <- new_col(tb)
}
tb <- add_col(tb, tests)
tb <- add_col(tb, length(row$data)-sum(grid))
}
Using this any variables that are factors can be used now to generate a table and render to HTML5.
tbl1 <- tangram("modality ~ procedure + category + prior", d1, transforms=fda_cat_by_cat)
html5(tbl1, fragment=TRUE, inline="nejm.css", caption = "FDA Table 1", id="tbl1")
X | Y | Z | X vs. Y | X vs. Z | Missing | |
3283 | 3191 | 3232 | P-value | P-value | ||
Ventral hernia procedure | 0.3701 | 0.6881 | 294 | |||
Incisional | 456 (13.9) | 482 (15.1) | 466 (14.4) | |||
Parastomal | 465 (14.2) | 449 (14.1) | 474 (14.7) | |||
Incisional and Parastomal | 476 (14.5) | 463 (14.5) | 426 (13.2) | |||
Epigastric (primary hernia) | 488 (14.9) | 452 (14.2) | 467 (14.4) | |||
Umbilical (primary hernia) | 479 (14.6) | 428 (13.4) | 493 (15.3) | |||
Spigelian (primary hernia) | 474 (14.4) | 440 (13.8) | 451 (14.0) | |||
Lumbar (primary hernia) | 445 (13.6) | 477 (14.9) | 455 (14.1) | |||
Primary or recurrent | 0.9881 | 0.9611 | 319 | |||
D | 1626 (49.9) | 1593 (49.9) | 1611 (49.9) | |||
E | 1634 (50.1) | 1602 (50.1) | 1615 (50.1) | |||
Number of prior hernia repairsamong recurrent | 0.5931 | 0.3051 | 113 | |||
0 | 1282 (38.4) | 1211 (37.2) | 1263 (38.4) | |||
1 | 1132 (33.9) | 1151 (35.4) | 1179 (35.8) | |||
2 | 646 (19.3) | 632 (19.4) | 592 (18.0) | |||
3 | 229 (6.9) | 202 (6.2) | 205 (6.2) | |||
4 | 41 (1.2) | 48 (1.5) | 39 (1.2) | |||
5+ | 9 (0.3) | 11 (0.3) | 15 (0.5) |
Next it is needed to allow for row variables that are continuous. We begin with the helper function that creates cells for the tests given the data for a row (x) and colunn (y). In this case we make no distribution assumption about the continuous variable and apply a Wilcoxon rank sum test.
wilcoxTests <- function(x, y)
{
lvls <- levels(y)
lapply(2:length(lvls), FUN=function(i){
test <- wilcox.test(x[y==lvls[1]], x[y==lvls[i]])
cell(render_f(test$p.value, 3), reference=3)
})
}
Similarly we create a table builder for continuous by M category summaries. The resulting table is (4) X (2M). There is a row for the row variable name, and the mean, median and standard deviation. Column’s are the same as above.
fda_cont_by_cat <- function(tb, row, col, ...)
{
datar <- row$data
datac <- col$data
lvls <- levels(datac)
colN <- lapply(lvls, function(cat)
cell_n(length(datac[datac == cat & !is.na(datac)]), subcol=cat))
versus <- paste(lvls[1], "vs.", lvls[2:length(lvls)])
# Now construct the table by add rows to each column
tb <- col_header(tb, lvls, versus, "Missing")
tb <- col_header(tb, colN, rep("P-value", length(versus)), "")
tb <- row_header(tb, derive_label(row))
for(nm in c("Mean", "Median", "SD")) tb <- row_header(tb, paste0(" ",nm))
# Summary
for(colnm in lvls)
{
d <- datar[datac == colnm & !is.na(datac)]
tb <- add_row(tb, "")
tb <- add_row(tb, render_f(mean(d, na.rm=TRUE), row$format))
tb <- add_row(tb, render_f(median(d, na.rm=TRUE), row$format))
tb <- add_row(tb, render_f(sd(d, na.rm=TRUE), row$format))
tb <- new_col(tb)
}
# Tests
tests <- wilcoxTests(datar, datac)
tb <- add_col(tb, tests)
tb <- add_col(tb, length(datar)-sum(!is.na(datar) & !is.na(datac)))
tb
}
This step bundles the two together and based on type of variable decides which transform to apply. We use the hmisc type determination function as a quick guide. Note that some transforms are unsupported as we there was no requirement to provide those cross product tables of variables.
Further we add some descriptive footnotes.
unsupported <- function(tb, row, col) stop("unsupported type", row$value, "X", col$value)
fda <- list(
Type = hmisc_data_type,
Numerical = list(
Numerical = unsupported,
Categorical = fda_cont_by_cat
),
Categorical = list(
Numerical = unsupported,
Categorical = fda_cat_by_cat
),
Footnote = "Count (Percent) format. ^1^ χ^2^ minus one. ^2^ Fisher exact. ^3^ Wilcoxon rank sum"
)
Now a rendering with two forms of information is possible.
tbl2 <- tangram(modality ~ procedure + category + prior + albumin, d1, transforms=fda)
html5(tbl2, fragment=TRUE, inline="nejm.css", caption = "FDA Table 2", id="tbl2")
X | Y | Z | X vs. Y | X vs. Z | Missing | |
3283 | 3191 | 3232 | P-value | P-value | ||
Ventral hernia procedure | 0.3701 | 0.6881 | 294 | |||
Incisional | 456 (13.9) | 482 (15.1) | 466 (14.4) | |||
Parastomal | 465 (14.2) | 449 (14.1) | 474 (14.7) | |||
Incisional and Parastomal | 476 (14.5) | 463 (14.5) | 426 (13.2) | |||
Epigastric (primary hernia) | 488 (14.9) | 452 (14.2) | 467 (14.4) | |||
Umbilical (primary hernia) | 479 (14.6) | 428 (13.4) | 493 (15.3) | |||
Spigelian (primary hernia) | 474 (14.4) | 440 (13.8) | 451 (14.0) | |||
Lumbar (primary hernia) | 445 (13.6) | 477 (14.9) | 455 (14.1) | |||
Primary or recurrent | 0.9881 | 0.9611 | 319 | |||
D | 1626 (49.9) | 1593 (49.9) | 1611 (49.9) | |||
E | 1634 (50.1) | 1602 (50.1) | 1615 (50.1) | |||
Number of prior hernia repairsamong recurrent | 0.5931 | 0.3051 | 113 | |||
0 | 1282 (38.4) | 1211 (37.2) | 1263 (38.4) | |||
1 | 1132 (33.9) | 1151 (35.4) | 1179 (35.8) | |||
2 | 646 (19.3) | 632 (19.4) | 592 (18.0) | |||
3 | 229 (6.9) | 202 (6.2) | 205 (6.2) | |||
4 | 41 (1.2) | 48 (1.5) | 39 (1.2) | |||
5+ | 9 (0.3) | 11 (0.3) | 15 (0.5) | |||
Albuming/dL | 0.4663 | 0.3513 | 210 | |||
Mean | 3.498 | 3.505 | 3.506 | |||
Median | 3.501 | 3.507 | 3.507 | |||
SD | 0.408 | 0.398 | 0.399 |
A tricky binary coded varible for reported side effects needs treatment. In this instance we only want the category in which side effects appeary, i.e. only those individuals with side effects is to be reported. The variable contains a binary number in which each bit represents a different side effect reported.
I have chosen to handle this in the formula syntax with the *
operator for now. I have debated adding the traditional |
denoting nested models to the formula syntax, but at present even handling the *
properly is complicated and incomplete.
Secondly as mentioned above additional variables are passed down to the transform which can make use of them. This is useful now for passing in a binary transform table (but it would be used for all transforms if multiple existed in a table, further refinement of list of lists could be used if needed).
The basic approach is to expand the data into a long form, then pass to original cat X cat function using the display_percent logical to turn that off.
side_effect_key = list(
"Repetative Uttering of Wut?",
"Excessive Sweating",
"Hairy Navel",
"Breaking Voice",
"Beiber Fever",
"Swiftaphila",
"Akward Elbows",
"Veruca"
)
fda_binary <- function(tb, row, col, binary_key=list(), ...)
{
inside <- row$right$data # Grouped inside the right hand side of '*' assuming logical
inside[is.na(inside)] <- FALSE
datar <- row$left$data[inside] # Data to further group
datac <- col$data[inside]
# Expand for counting
x <- rep(datar, each=length(binary_key))
y <- rep(datac, each=length(binary_key))
key <- rep(1:length(binary_key), length(datar))
present <- bitwAnd(x, 2^(key-1)) > 0
# Filter down
x <- factor(sapply(key[present], function(x) binary_key[[x]]))
y <- y[present]
rname <- paste0(row$left$name(), " N=", sum(inside))
fda_cat_by_cat(tb, list(data=x, name=function() rname), list(data=y, name=col$name),
display_percent=FALSE)
}
fda_data_type <- function(x, category_threshold=NA)
{
if(is.categorical(x,category_threshold)) "Categorical" else
if(is.numeric(x)) "Numerical" else
stop(paste("Unsupported class/type - ",class(x), typeof(x)))
}
# Note the second dimension isn't present, it only determines function call by type of Row
# If provided a list of types to functions for each argument a cross product of types
# determines the functional transform. But this is a nice short cut provided.
fda <- list(
Type = fda_data_type,
Numerical = fda_cont_by_cat,
Categorical = fda_cat_by_cat,
ASTMultiply = fda_binary,
Footnote = "Count (Percent) format. ^1^ χ^2^ minus one. ^2^ Fisher exact. ^3^ Wilcoxon rank sum"
)
Now we have 3 different pieces completed.
tbl3 <- tangram(modality ~ procedure + category + prior + albumin + reported_side_effects*side_effect, d1, transforms=fda, binary_key=side_effect_key)
html5(tbl3, fragment=TRUE, inline="nejm.css", caption = "FDA Table 3", id="tbl3")
X | Y | Z | X vs. Y | X vs. Z | Missing | |
3283 | 3191 | 3232 | P-value | P-value | ||
Ventral hernia procedure | 0.3701 | 0.6881 | 294 | |||
Incisional | 456 (13.9) | 482 (15.1) | 466 (14.4) | |||
Parastomal | 465 (14.2) | 449 (14.1) | 474 (14.7) | |||
Incisional and Parastomal | 476 (14.5) | 463 (14.5) | 426 (13.2) | |||
Epigastric (primary hernia) | 488 (14.9) | 452 (14.2) | 467 (14.4) | |||
Umbilical (primary hernia) | 479 (14.6) | 428 (13.4) | 493 (15.3) | |||
Spigelian (primary hernia) | 474 (14.4) | 440 (13.8) | 451 (14.0) | |||
Lumbar (primary hernia) | 445 (13.6) | 477 (14.9) | 455 (14.1) | |||
Primary or recurrent | 0.9881 | 0.9611 | 319 | |||
D | 1626 (49.9) | 1593 (49.9) | 1611 (49.9) | |||
E | 1634 (50.1) | 1602 (50.1) | 1615 (50.1) | |||
Number of prior hernia repairsamong recurrent | 0.5931 | 0.3051 | 113 | |||
0 | 1282 (38.4) | 1211 (37.2) | 1263 (38.4) | |||
1 | 1132 (33.9) | 1151 (35.4) | 1179 (35.8) | |||
2 | 646 (19.3) | 632 (19.4) | 592 (18.0) | |||
3 | 229 (6.9) | 202 (6.2) | 205 (6.2) | |||
4 | 41 (1.2) | 48 (1.5) | 39 (1.2) | |||
5+ | 9 (0.3) | 11 (0.3) | 15 (0.5) | |||
Albuming/dL | 0.4663 | 0.3513 | 210 | |||
Mean | 3.498 | 3.505 | 3.506 | |||
Median | 3.501 | 3.507 | 3.507 | |||
SD | 0.408 | 0.398 | 0.399 | |||
Reported Side Effects N=4931 | 0.9811 | 0.8421 | 236 | |||
Akward Elbows | 829 | 807 | 799 | |||
Beiber Fever | 820 | 823 | 779 | |||
Breaking Voice | 820 | 791 | 735 | |||
Excessive Sweating | 823 | 830 | 809 | |||
Hairy Navel | 849 | 814 | 785 | |||
Repetative Uttering of Wut? | 832 | 802 | 807 | |||
Swiftaphila | 834 | 791 | 790 | |||
Veruca | 835 | 795 | 748 |
And the requested table tranforms for FDA work is complete.
Here is the final version of the code:
###############################################################
# Statistical tests as table cells
#
chiTests <- function(grid)
{
lapply(2:dim(grid)[2], FUN=function(i){
consider <- grid[,c(1, i)]
if(min(consider) >= 1)
{
test <- suppressWarnings(chisq.test(consider, correct=FALSE))
stat <- unname(test$statistic) * (sum(consider)-1) / sum(consider)
cell(render_f(pchisq(stat, test$parameter, lower.tail=FALSE), 3), reference=1)
}
else
{
cell(render_f(fisher.test(consider, simulate.p.value=TRUE, B=1e7)$p.value, 3), reference=2)
}
})
}
wilcoxTests <- function(x, y)
{
lvls <- levels(y)
lapply(2:length(lvls), FUN=function(i){
test <- wilcox.test(x[y==lvls[1]], x[y==lvls[i]])
cell(render_f(test$p.value, 3), reference=3)
})
}
###############################################################
# Row/Column from abstract syntax tree transforms to tables
#
fda_cat_by_cat <- function(tb, row, col, display_percent=TRUE, ...)
{
grid <- table(row$data, col$data)
tests <- chiTests(grid)
colN <- lapply(colnames(grid), function(cat) cell_n(sum(grid[,cat]), subcol=cat))
rowlbl <- lapply(rownames(grid), function(x) paste(" ", x))
versus <- paste(colnames(grid)[1], "vs.", colnames(grid)[2:length(colnames(grid))])
# Now construct the table by add rows to each column
tb <- col_header(tb, colnames(grid), versus, "Missing")
tb <- col_header(tb, colN, rep("P-value", length(versus)), "")
tb <- row_header(tb, derive_label(row))
for(nm in rowlbl) tb <- row_header(tb, nm)
for(colnm in colnames(grid))
{
denom <- sum(grid[,colnm])
tb <- add_row(tb, "")
for(rownm in rownames(grid))
{
numer <- grid[rownm, colnm]
x <- if(display_percent) paste0(numer, " (", render_f(100*numer/denom, 1), ")") else
as.character(numer)
tb <- add_row(tb, cell(x, subcol = colnm, subrow = rownm))
}
tb <- new_col(tb)
}
tb <- add_col(tb, tests)
tb <- add_col(tb, length(row$data)-sum(grid))
}
fda_cont_by_cat <- function(tb, row, col, ...)
{
datar <- row$data
datac <- col$data
lvls <- levels(datac)
colN <- lapply(lvls, function(cat)
cell_n(length(datac[datac == cat & !is.na(datac)]), subcol=cat))
versus <- paste(lvls[1], "vs.", lvls[2:length(lvls)])
# Now construct the table by add rows to each column
tb <- col_header(tb, lvls, versus, "Missing")
tb <- col_header(tb, colN, rep("P-value", length(versus)), "")
tb <- row_header(tb, derive_label(row))
for(nm in c("Mean", "Median", "SD")) tb <- row_header(tb, paste0(" ",nm))
# Summary
for(colnm in lvls)
{
d <- datar[datac == colnm & !is.na(datac)]
tb <- add_row(tb, "")
tb <- add_row(tb, render_f(mean(d, na.rm=TRUE), row$format))
tb <- add_row(tb, render_f(median(d, na.rm=TRUE), row$format))
tb <- add_row(tb, render_f(sd(d, na.rm=TRUE), row$format))
tb <- new_col(tb)
}
# Tests
tests <- wilcoxTests(datar, datac)
tb <- add_col(tb, tests)
tb <- add_col(tb, length(datar)-sum(!is.na(datar) & !is.na(datac)))
tb
}
fda_binary <- function(tb, row, col, binary_key=list(), ...)
{
inside <- row$right$data # Grouped inside the right hand side of '*' assuming logical
inside[is.na(inside)] <- FALSE
datar <- row$left$data[inside] # Data to further group
datac <- col$data[inside]
# Expand for counting
x <- rep(datar, each=length(binary_key))
y <- rep(datac, each=length(binary_key))
key <- rep(1:length(binary_key), length(datar))
present <- bitwAnd(x, 2^(key-1)) > 0
# Filter down
x <- factor(sapply(key[present], function(x) binary_key[[x]]))
y <- y[present]
rname <- paste0(row$left$name(), " N=", sum(inside))
fda_cat_by_cat(tb, list(data=x, name=function() rname), list(data=y, name=col$name),
display_percent=FALSE)
}
###############################################################
# Data typing function to use with above
#
fda_data_type <- function(x, category_threshold=NA)
{
if(is.categorical(x,category_threshold)) "Categorical" else
if(is.numeric(x)) "Numerical" else
stop(paste("Unsupported class/type - ",class(x), typeof(x)))
}
###############################################################
#
# Bring it all together into a useable bundle of transforms
fda <- list(
Type = fda_data_type,
Numerical = fda_cont_by_cat,
Categorical = fda_cat_by_cat,
ASTMultiply = fda_binary,
Footnote = "Count (Percent) format. ^1^ χ^2^ minus one. ^2^ Fisher exact. ^3^ Wilcoxon rank sum"
)