While broom is useful for summarizing the result of a single analysis in a consistent format, it is really designed for high-throughput applications, where you must combine results from multiple analyses. These could be subgroups of data, analyses using different models, bootstrap replicates, permutations, and so on. In particular, it plays well with the group_by
and do
functions in dplyr
.
Let's try this on a simple dataset, the built-in Orange
data.frame.
library(broom)
library(dplyr)
data(Orange)
dim(Orange)
## [1] 35 3
head(Orange)
## Grouped Data: circumference ~ age | Tree
## Tree age circumference
## 1 1 118 30
## 2 1 484 58
## 3 1 664 87
## 4 1 1004 115
## 5 1 1231 120
## 6 1 1372 142
This contains 35 observations of three variables: Tree
, age
, and circumference
. Tree
is a factor with five levels describing five trees. As might be expected, age and circumference are correlated:
cor(Orange$age, Orange$circumference)
## [1] 0.9135189
library(ggplot2)
ggplot(Orange, aes(age, circumference, color = Tree)) + geom_line()
Suppose you want to test for correlations individually within each tree. You can do this with dplyr's group_by
:
Orange %>% group_by(Tree) %>% summarize(correlation = cor(age, circumference))
## # A tibble: 5 × 2
## Tree correlation
## <ord> <dbl>
## 1 3 0.9881766
## 2 1 0.9854675
## 3 5 0.9877376
## 4 2 0.9873624
## 5 4 0.9844610
(Note that the correlations are much higher than the aggregated one, and furthermore we can now see it is similar across trees).
Suppose that instead of simply estimating a correlation, we want to perform a hypothesis test with cor.test
:
cor.test(Orange$age, Orange$circumference)
##
## Pearson's product-moment correlation
##
## data: Orange$age and Orange$circumference
## t = 12.9, df = 33, p-value = 1.931e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8342364 0.9557955
## sample estimates:
## cor
## 0.9135189
This contains multiple values we could want in our output. Some are vectors of length 1, such as the p-value and the estimate, and some are longer, such as the confidence interval. broom's tidy
S3 method, combined with dplyr's do
, makes it easy to summarize the information about each test:
Orange %>% group_by(Tree) %>% do(tidy(cor.test(.$age, .$circumference)))
## Source: local data frame [5 x 9]
## Groups: Tree [5]
##
## Tree estimate statistic p.value parameter conf.low conf.high
## <ord> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 3 0.9881766 14.41188 2.901046e-05 5 0.9189858 0.9983260
## 2 1 0.9854675 12.97258 4.851902e-05 5 0.9012111 0.9979400
## 3 5 0.9877376 14.14686 3.177093e-05 5 0.9160865 0.9982635
## 4 2 0.9873624 13.93129 3.425041e-05 5 0.9136142 0.9982101
## 5 4 0.9844610 12.53575 5.733090e-05 5 0.8946782 0.9977964
## # ... with 2 more variables: method <fctr>, alternative <fctr>
This becomes even more useful when applied to regressions, which give more than one row of output within each model:
Orange %>% group_by(Tree) %>% do(tidy(lm(age ~ circumference, data=.)))
## Source: local data frame [10 x 6]
## Groups: Tree [5]
##
## Tree term estimate std.error statistic p.value
## <ord> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 3 (Intercept) -209.512321 85.2682904 -2.4570954 5.743323e-02
## 2 3 circumference 12.038885 0.8353445 14.4118806 2.901046e-05
## 3 1 (Intercept) -264.673437 98.6205569 -2.6837553 4.362351e-02
## 4 1 circumference 11.919245 0.9188029 12.9725813 4.851902e-05
## 5 5 (Intercept) -54.484097 76.8862788 -0.7086323 5.102144e-01
## 6 5 circumference 8.787132 0.6211365 14.1468610 3.177093e-05
## 7 2 (Intercept) -132.439725 83.1314146 -1.5931369 1.720092e-01
## 8 2 circumference 7.795225 0.5595479 13.9312907 3.425041e-05
## 9 4 (Intercept) -76.513671 88.2943757 -0.8665747 4.257969e-01
## 10 4 circumference 7.169842 0.5719516 12.5357484 5.733090e-05
You can just as easily perform multiple regressions within each group, as shown here on the mtcars
dataset. We group the data into automatic and manual cars (the am
column), then perform the regression within each.
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars %>% group_by(am) %>% do(tidy(lm(wt ~ mpg + qsec + gear, .)))
## Source: local data frame [8 x 6]
## Groups: am [2]
##
## am term estimate std.error statistic p.value
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0 (Intercept) 4.91754623 1.39665675 3.52094116 0.0030879519
## 2 0 mpg -0.19188914 0.04428329 -4.33321746 0.0005909953
## 3 0 qsec 0.09191361 0.09832067 0.93483509 0.3646797728
## 4 0 gear 0.14653754 0.36819363 0.39799041 0.6962441554
## 5 1 (Intercept) 4.28307028 3.45859958 1.23838281 0.2469014834
## 6 1 mpg -0.10098320 0.02943409 -3.43082488 0.0074984578
## 7 1 qsec 0.03983165 0.15112135 0.26357393 0.7980436972
## 8 1 gear -0.02288330 0.34878226 -0.06560912 0.9491232955
What if you want not just the tidy
output, but the augment
and glance
outputs as well, while still performing each regression only once? First, save the modeling result into a column fit
.
regressions <- mtcars %>% group_by(cyl) %>%
do(fit = lm(wt ~ mpg + qsec + gear, .))
regressions
## Source: local data frame [3 x 2]
## Groups: <by row>
##
## # A tibble: 3 × 2
## cyl fit
## * <dbl> <list>
## 1 4 <S3: lm>
## 2 6 <S3: lm>
## 3 8 <S3: lm>
This creates a rowwise data frame. Tidying methods are designed to work seamlessly with rowwise data frames, grouping them and performing tidying on each row:
regressions %>% tidy(fit)
## Source: local data frame [12 x 6]
## Groups: cyl [3]
##
## cyl term estimate std.error statistic p.value
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 4 (Intercept) -0.772662398 2.22788026 -0.346815047 0.738922522
## 2 4 mpg -0.081831569 0.02381787 -3.435721470 0.010900258
## 3 4 qsec 0.216651715 0.07590773 2.854145649 0.024542352
## 4 4 gear 0.267469618 0.24454173 1.093758587 0.310264464
## 5 6 (Intercept) -7.785808290 3.35484932 -2.320762438 0.103012430
## 6 6 mpg 0.043283282 0.05196724 0.832895538 0.466010439
## 7 6 qsec 0.421998315 0.09136817 4.618657640 0.019102466
## 8 6 gear 0.638320019 0.20524143 3.110093347 0.052877671
## 9 8 (Intercept) 0.005971578 4.27460511 0.001396989 0.998912840
## 10 8 mpg -0.176924561 0.05570852 -3.175897889 0.009888094
## 11 8 qsec 0.369405813 0.19309553 1.913072859 0.084773637
## 12 8 gear 0.142762416 0.31664127 0.450864838 0.661705480
regressions %>% augment(fit)
## Source: local data frame [32 x 12]
## Groups: cyl [3]
##
## cyl wt mpg qsec gear .fitted .se.fit .resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 2.320 22.8 18.61 4 2.463345 0.1418571 -0.1433447361
## 2 4 3.190 24.4 20.00 4 2.633560 0.1199552 0.5564398892
## 3 4 3.150 22.8 22.90 4 3.392781 0.2989223 -0.2427805952
## 4 4 2.200 32.4 19.47 4 1.864082 0.1738468 0.3359178471
## 5 4 1.615 30.4 18.52 4 1.821926 0.1475401 -0.2069261605
## 6 4 1.835 33.9 19.90 4 1.834495 0.2098610 0.0005049623
## 7 4 2.465 21.5 20.01 3 2.605569 0.2493871 -0.1405685584
## 8 4 1.935 27.3 18.90 4 2.157932 0.1045713 -0.2229316749
## 9 4 2.140 26.0 16.70 5 2.055149 0.2176428 0.0848514414
## 10 4 1.513 30.4 16.90 5 1.738420 0.2009845 -0.2254199999
## # ... with 22 more rows, and 4 more variables: .hat <dbl>, .sigma <dbl>,
## # .cooksd <dbl>, .std.resid <dbl>
regressions %>% glance(fit)
## Source: local data frame [3 x 12]
## Groups: cyl [3]
##
## cyl r.squared adj.r.squared sigma statistic p.value df
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 4 0.7799131 0.6855901 0.31936726 8.268539 0.010597767 4
## 2 6 0.9699947 0.9399893 0.08729425 32.327394 0.008743763 4
## 3 8 0.6521278 0.5477661 0.51068708 6.248728 0.011614605 4
## # ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
## # deviance <dbl>, df.residual <int>
By combining the estimates and p-values across all groups into the same tidy data frame (instead of, for example, a list of output model objects), a new class of analyses and visualizations becomes straightforward. This includes
In each of these cases, we can easily filter, facet, or distinguish based on the term
column. In short, this makes the tools of tidy data analysis available for the results of data analysis and models, not just the inputs.