The corrgram
package is an implementation of correlograms. This vignette reproduces most of the figures in Friendly (2002).
library("knitr")
opts_chunk$set(fig.align="center", fig.width=6, fig.height=6)
options(width=90)
The data are 11 measures of performance and salary for 263 baseball players in the 1986 baseball season in the United States. The data were used in 1988 Data Expo at the Joint Statistical Meetings.
The first 6 rows of the data and the upper-left corner of the correlation matrix are given below.
library("corrgram")
head(baseball)
## Name League Team Position Atbat Hits Homer Runs RBI Walks Years Atbatc Hitsc
## 1 Andy Allanson A CLE C 293 66 1 30 29 14 1 293 66
## 2 Alan Ashby N HOU C 315 81 7 24 38 39 14 3449 835
## 3 Alvin Davis A SEA 1B 479 130 18 66 72 76 3 1624 457
## 4 Andre Dawson N MON OF 496 141 20 65 78 37 11 5628 1575
## 5 A Galarraga N MON 1B 321 87 10 39 42 30 2 396 101
## 6 A Griffin A OAK SS 594 169 4 74 51 35 11 4408 1133
## Homerc Runsc RBIc Walksc Putouts Assists Errors Salary logSal
## 1 1 30 29 14 446 33 20 NA NA
## 2 69 321 414 375 632 43 10 475 2.676694
## 3 63 224 266 263 880 82 14 480 2.681241
## 4 225 828 838 354 200 11 3 500 2.698970
## 5 12 48 46 33 805 40 4 92 1.963788
## 6 19 501 336 194 282 421 25 750 2.875061
round(cor(baseball[, 5:14], use="pair"),2)
## Atbat Hits Homer Runs RBI Walks Years Atbatc Hitsc Homerc
## Atbat 1.00 0.97 0.59 0.91 0.82 0.67 0.05 0.24 0.25 0.24
## Hits 0.97 1.00 0.56 0.92 0.81 0.64 0.04 0.23 0.26 0.20
## Homer 0.59 0.56 1.00 0.65 0.86 0.48 0.12 0.22 0.22 0.49
## Runs 0.91 0.92 0.65 1.00 0.80 0.73 0.00 0.19 0.20 0.23
## RBI 0.82 0.81 0.86 0.80 1.00 0.62 0.15 0.29 0.31 0.44
## Walks 0.67 0.64 0.48 0.73 0.62 1.00 0.14 0.28 0.28 0.33
## Years 0.05 0.04 0.12 0.00 0.15 0.14 1.00 0.92 0.90 0.73
## Atbatc 0.24 0.23 0.22 0.19 0.29 0.28 0.92 1.00 1.00 0.80
## Hitsc 0.25 0.26 0.22 0.20 0.31 0.28 0.90 1.00 1.00 0.78
## Homerc 0.24 0.20 0.49 0.23 0.44 0.33 0.73 0.80 0.78 1.00
Figure 2 shows two ways to graphically display the correlation matrix using the panel.shade()
and panel.pie()
functions.
vars2 <- c("Assists","Atbat","Errors","Hits","Homer","logSal",
"Putouts","RBI","Runs","Walks","Years")
corrgram(baseball[,vars2], order=TRUE,
main="Baseball data PC2/PC1 order",
lower.panel=panel.shade, upper.panel=panel.pie,
diag.panel=panel.minmax, text.panel=panel.txt)
Figure 3 shows an eigenvector plot of the correlation matrix. This forms the basis of the orderings of the variables in the corrgram in the next section.
baseball.cor <- cor(baseball[,vars2], use='pair')
baseball.eig <- eigen(baseball.cor)$vectors[,1:2]
e1 <- baseball.eig[,1]
e2 <- baseball.eig[,2]
plot(e1,e2,col='white', xlim=range(e1,e2), ylim=range(e1,e2))
text(e1,e2, rownames(baseball.cor), cex=1)
title("Eigenvector plot of baseball data")
arrows(0, 0, e1, e2, cex=0.5, col="red", length=0.1)
In figure 4a the variables are sorted in the order as given in the data. In figure 4b, the variables are sorted according to a principal component ordering to look for possible clustering of the variables. It is not surprising to see that more times at bat is strongly correlated with a higher number of hits and a higher number of runs.
corrgram(baseball[,vars2], main="Baseball data (alphabetic order)")
corrgram(baseball[,vars2], order=TRUE,
main="Baseball data (PC order)",
panel=panel.shade, text.panel=panel.txt)
Figure 5 shows a corrgram for all numeric variables in the dataframe. Non-numeric columns in the data are ignored.
corrgram(baseball, order=TRUE, main="Baseball data (PC order)")
Figure 6 shows a corrgram of automotive data on 74 different models of cars from 1979. There are two obvious groups of variables
Note, the arrangement is slightly different from Friendly.
corrgram(auto, order=TRUE, main="Auto data (PC order)")
The inverse of the correlation matrix expresses conditional dependence and independence of the variables.
The variables are sorted in the same order as in figure 4. One example interpretation is: controlling for all other variables, there is still a large correlation between Years and log Salary.
rinv <- function(r){
# r is a correlation matrix
# calculate r inverse and scale to correlation matrix
# Derived from Michael Friendly's SAS code
ri <- solve(r)
s <- diag(ri)
s <- diag(sqrt(1/s))
ri <- s %*% ri %*% s
n <- nrow(ri)
ri <- ri * (2*rep(1,n) - matrix(1, n, n))
diag(ri) <- 1 # Should already be 1, but could be 1 + epsilon
colnames(ri) <- rownames(ri) <- rownames(r)
return(ri)
}
vars7 <- c("Years", "logSal", "Homer", "Putouts", "RBI", "Walks",
"Runs", "Hits", "Atbat", "Errors", "Assists")
cb <- cor(baseball[,vars7], use="pair")
corrgram(-rinv(cb), main=expression(paste("Baseball data ", R^-1)))