Furniture

2017-07-07

Using furniture

We will first make a ficticious data set:

df <- data.frame(a = rnorm(1000, 1.5, 2), 
                 b = seq(1, 1000, 1), 
                 c = c(rep("control", 400), rep("Other", 70), rep("treatment", 500), rep("None", 30)),
                 d = c(sample(1:1000, 900, replace=TRUE), rep(-99, 100)))

There are two functions that we’ll demonstrate here:

  1. washer
  2. table1

Washer

washer is a great function for quick data cleaning. In situations where there are placeholders, extra levels in a factor, or several values need to be changed to another.

library(tidyverse)

df <- df %>%
  mutate(d = washer(d, -99),  ## changes the placeholder -99 to NA
         c = washer(c, "Other", "None", value = "control")) ## changes "Other" and "None" to "Control"

Table1

Now that the data is “washed” we can start exploring and reporting.

table1(df, a, b, factor(c), d)
## 
## |==============================|
##               Mean/Count (SD/%)
##  Observations 1000             
##  a                             
##               1.6 (2.0)        
##  b                             
##               500.5 (288.8)    
##  factor(c)                     
##     control   500 (50%)        
##     treatment 500 (50%)        
##  d                             
##               488.3 (286.8)    
## |==============================|

The variables must be numeric or factor. Since we use a special type of evaluation (i.e. Non-Standard Evaluation) we can change the variables in the function (e.g., factor(c)). This can be extended to making a whole new variable in the function as well.

table1(df, a, b, d, ifelse(a > 1, 1, 0))
## 
## |=====================================|
##                      Mean/Count (SD/%)
##  Observations        1000             
##  a                                    
##                      1.6 (2.0)        
##  b                                    
##                      500.5 (288.8)    
##  d                                    
##                      488.3 (286.8)    
##  ifelse(a > 1, 1, 0)                  
##                      0.6 (0.5)        
## |=====================================|

This is just the beginning though. Two powerful things the function can do are shown below:

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE)
## 
## |=======================================================|
##                                  factor(c) 
##                      control       treatment     P-Value
##  Observations        500           500                  
##  a                                               0.175  
##                      1.5 (2.0)     1.7 (1.9)            
##  b                                               <.001  
##                      280.5 (221.7) 720.5 (144.5)        
##  d                                               0.384  
##                      496.2 (286.9) 479.5 (286.7)        
##  ifelse(a > 1, 1, 0)                             0.513  
##                      0.6 (0.5)     0.6 (0.5)            
## |=======================================================|

The splitby = ~factor(c) stratifies the means and counts by a factor variable (in this case either control or treatment). When we use this we can also automatically compute tests of significance using test=TRUE.

Finally, you can polish it quite a bit using a few other options. For example, you can do the following:

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       type = c("simple", "condensed"))
## 
## |================================================|
##                           factor(c) 
##               control       treatment     P-Value
##  Observations 500           500                  
##  A            1.5 (2.0)     1.7 (1.9)     0.175  
##  B            280.5 (221.7) 720.5 (144.5) <.001  
##  D            496.2 (286.9) 479.5 (286.7) 0.384  
##  New Var      0.6 (0.5)     0.6 (0.5)     0.513  
## |================================================|

You can also format the numbers (adding a comma for big numbers such as in 20,000 instead of 20000):

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       format_number = TRUE)
## 
## |================================================|
##                           factor(c) 
##               control       treatment     P-Value
##  Observations 500           500                  
##  A                                        0.175  
##               1.5 (2.0)     1.7 (1.9)            
##  B                                        <.001  
##               280.5 (221.7) 720.5 (144.5)        
##  D                                        0.384  
##               496.2 (286.9) 479.5 (286.7)        
##  New Var                                  0.513  
##               0.6 (0.5)     0.6 (0.5)            
## |================================================|

The table can be exported directly to a folder in the working directory called “Table1”. Using export, we provide it with a string that will be the name of the CSV containing the formatted table.

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       format_number = TRUE,
       export = "example_table1")

This can also be outputted as a latex, markdown, or pandoc table (matching all the output types of knitr::kable). Below shows how to do a latex table:

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       output = "latex")

Last item to show you regarding table1() is that it can be printed in a simplified and condensed form. This instead of reporting counts and percentages for categorical variables, it reports only percentages and the table has much less white space.

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       type = c("simple", "condensed"))
## 
## |================================================|
##                           factor(c) 
##               control       treatment     P-Value
##  Observations 500           500                  
##  A            1.5 (2.0)     1.7 (1.9)     0.175  
##  B            280.5 (221.7) 720.5 (144.5) <.001  
##  D            496.2 (286.9) 479.5 (286.7) 0.384  
##  New Var      0.6 (0.5)     0.6 (0.5)     0.513  
## |================================================|

Conclusion

The three functions: table1 and washer add simplicity to cleaning up and understanding your data. Use these pieces of furniture to make your quantitative life a bit easier.