Our team here at Vanderbilt replies quite a bit on exploratory data analysis, and thus summary statistics are an important first step. One of the tools we’ve brought to bear on this is tangram which allows for customized statistical tables. Customization aside, the question is how can I quickly get a set of summary statistics into an Rmd document quickly and painlessly using tangram?

Turns out it’s easy! Statistics are presentable at the console or the same table layout is available in HTML5, LaTeX or RTF and styleable to your choice of styles.

# History

tangram was originally developed as a replacement to Hmisc::summaryM as we needed to be able to index values and trace them through a Colloboration process. However to do that, a framework was invented that allowed for complete user customization at each step of the pipeline: Parsing, Transformation, and Rendering. Parsing takes an R formula and generates an Abstract Syntax Tree (AST) which contains pieces of a data frame. This in an of itself is a useable piece outside of table generation. Getting this AST generation to match exactly the syntax of R formulas is ongoing. Care has been taken to make sure that no semantics are present in the AST, i.e. the meaning of anything in side a formula is not coded into the AST representation. Semantic meaning is given by the transform specified. The default being summaryM. The cross product of each term on each side creates a set of rows and columns that are passed to a transform given by list of list by data type. Data typing is also a customizable part of a transform bundle. The final abstract table object is renderable to a wide variety of formats, and once again this is user customizable. See the trend here, everything is overrideable such that my opinions about how things should be done are not enforced upon the end user. My opinions on statistics or summaries are not a limiting factor of the library.

Finding out the interfaces inside this package has been quite a journey of discovery. An early proof of concept has been going through refactoring, which each effort deleting more and more code and increasing functionality with each pass. The interface between these layers has begun to stablize and refactors in areas aren’t spilling over in major ways. To this end, the internal representation is a modified version of Markdown. There’s still a lot of work to do to fully realize the vision, but what’s available now is useful, and useful is the most important piece of any model or tool.

# Quick Summaries!

It by default will take a data frame, and if there exist columns that are of class cell it will render directly as a table. Otherwise, it will generate a summary versus an intercept model.

tangram(iris)
=============================================================================
N                   All                     Test Statistic
150
-----------------------------------------------------------------------------
Sepal.Length   150  4.30 5.10 *5.80* 6.40 7.90 5.84±0.07   V=11325, P=0.000
Sepal.Width    150  2.00 2.80 *3.00* 3.31 4.40 3.06±0.04   V=11325, P=0.000
Petal.Length   150  1.00 1.59 *4.35* 5.10 6.90 3.76±0.14   V=11325, P=0.000
Petal.Width    150  0.10 0.30 *1.30* 1.80 2.50 1.20±0.06   V=11325, P=0.000
Species        150                                        X^2_2=0.00, P=1.000
setosa                      0.333   50/150
versicolor                  0.333   50/150
virginica                   0.333   50/150
=============================================================================

It’s clear that a better breakdown model is possible since Species is a factor. One can then switch to a formula interface.

iris_descrip <- tangram(Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width, iris)
iris_descrip
==================================================================================================
N        setosa          versicolor        virginica           Test Statistic
50                50                50
--------------------------------------------------------------------------------------------------
Petal.Length  150  1.40 *1.50* 1.60  4.00 *4.35* 4.60  5.10 *5.55* 5.90  F_{2,147}=515.64, P=0.000
Petal.Width   150  0.20 *0.20* 0.30  1.20 *1.30* 1.50  1.80 *2.00* 2.30  F_{2,147}=541.25, P=0.000
Sepal.Length  150  4.80 *5.00* 5.20  5.60 *5.90* 6.30  6.20 *6.50* 6.92  F_{2,147}=136.85, P=0.000
Sepal.Width   150  3.19 *3.40* 3.70  2.50 *2.80* 3.00  2.80 *3.00* 3.20  F_{2,147}=54.69, P=0.000
==================================================================================================

Maybe now with that set of descriptive statistics, one is ready to include it in an Rmd for colloborator consumption. Same object, different renderer is called.

html5(iris_descrip,
fragment=TRUE, inline="nejm.css", caption = "Iris Stats", id="tbl_iris")
Iris Stats
 N setosa versicolor virginica Test Statistic 50 50 50 Petal.Length 150 1.401.501.60 4.004.354.60 5.105.555.90 F2,147 = 515.64,P = 0.0001 Petal.Width 150 0.200.200.30 1.201.301.50 1.802.002.30 F2,147 = 541.25,P = 0.0001 Sepal.Length 150 4.805.005.20 5.605.906.30 6.206.506.92 F2,147 = 136.85,P = 0.0001 Sepal.Width 150 3.193.403.70 2.502.803.00 2.803.003.20 F2,147 = 54.69,P = 0.0001
N is the number of non-missing value. 1Kruskal-Wallis test. 2Pearson test