CRAN Task View: Statistics for the Social Sciences

 Maintainer: John Fox Contact: jfox at mcmaster.ca Version: 2014-12-18

Social scientists use a wide range of statistical methods. To make the burden carried by this task view lighter, I have suppressed detail in some areas that are well covered by related task views (e.g., the Spatial task view for spatial statistics), and have pointed to those task views instead.

Most statistical data analysis in the social sciences is covered by the facilities in the base and recommended packages, which are part of the standard R distribution. In the package descriptions below, I identify base and recommended packages on first mention; packages that are not specifically identified as "R-base" or "recommended" are contributed packages.

One area of central interest to social scientists that I do not cover here is statistical graphics, even though this is one of the great strengths of R: Basic R graphics, trellis graphics (in the recommended lattice package), dynamic 3D graphs (via the rgl package), and the many packages that include facilities for various statistical graphs are just too extensive to detail here. Fortunately, a Graphics task view is currently in preparation.

If I have omitted something of importance, or if a new package or function should be mentioned here, please let me know.

Linear and Generalized Linear Models:

Univariate and multivariate linear models are fit by the lm function, generalized linear models by the glm function, both in the R-base stats package. Beyond summary and plot methods for lm and glm objects, there is a wide array of functions that support these objects:

• The generic anova function in the stats package constructs sequential analysis of variance and analysis of deviance tables, and can compute F and likelihood-ratio tests for nested models. (It is typical for other classes of statistical models in R to have anova methods as well.) The generic Anova function in the car package (associated with Fox, An R and S-PLUS Companion to Applied Regression, Sage, 2002) constructs so-called "Type-II" and "Type-III" tests for linear and generalized linear models.
• F and Wald tests for a variety of hypotheses are available from the coeftest and waldtest functions in the lmtest package, and the linear.hypothesis function in the car package. All of these functions permit the use of heteroscedasticity and heteroscedasticity/autocorrelation-consistent covariance matrices, as computed, e.g., by functions in the sandwich and car packages. Also see the glh.test function in the gmodels package. Nonlinear functions of parameters can be tested via the delta.method function in the alr3 package (associated with Weisberg, Applied Linear Regression, 3rd Ed., Wiley, 2005). The multcomp package includes functions for multiple comparisons. The vuong function in the pscl package tests non-nested hypotheses for generalized linear and some other models. Also see the rms package for tests on linear and generalized linear models.
• A basic R installation has excellent facilities for linear and generalized linear model "diagnostics," including, for example, hat-values and deletion statistics such as studentized residuals and Cook's distances ( hatvalues, rstudent, and cooks.distance, all in the stats package). These are augmented by other packages: several functions in the car package, which emphasizes graphical methods, e.g., cr.plots for component-plus-residual plots and av.plots for added-variable plots, in addition to numerical diagnostics, such vif for (generalized) variance-inflation factors; the dr package for dimension reduction in regression, including SIR, SAVE, and pHd; and the lmtest package, which implements a wide variety of tests (e.g., for heteroscedasticity, nonlinearity, and autocorrelation). More diagnostic methods, e.g., for inverse-response plots, may be found in the alr3 package. The forward package implements diagnostics based on a "forward search" (Atkinson and Riani, Robust Diagnostic Regression Analysis, Springer, 2000). Other collinearity diagnostics are in the perturb package. Diagnostics may also be found in the rms package.
• Several packages contain functions that are useful for interpreting linear and generalized linear models that have been fit to data: The qvcalc packages computes "quasi variances" for factors in linear and generalized linear models (and more generally). The effects package constructs effect displays, including, e.g., "adjusted means," for linear and generalized linear models. The Zelig package (see under "Collections" ) creates displays for many kinds of statistical models.

Analysis of Categorical and Count Data:

Binomial logit and probit models, as well as Poisson-regression and loglinear models for contingency tables (including models for "over-dispersed" binomial and Poisson data), can be fit with the glm function in the stats package. For over-dispersed data, see also the aod package and the glm.nb function in the recommended MASS package (associated with Venables and Ripley, Modern Applied Statistics in S, Fourth Ed. , Springer, 2002), which fits negative-binomial GLMs. The multinomial logit model is fit by the multinom function in the recommended nnet package, and ordered logit and probit models by the polr function in the MASS package. Also see the MNP package for the multinomial probit model, and multinomRob for the analysis of overdispersed multinomial data.

There are other noteworthy facilities for analyzing categorical and count data:

• The table function in the R-base base package and the xtabs and ftable functions in the stats package construct contingency tables.
• The chisq.test and fisher.test functions in the stats package may be used to test for independence in two-way contingency tables.
• The loglm and loglin functions in the MASS package fit hierachical loglinear models to contingency tables, the former as a front end to glm, the latter by iterative proportional fitting.
• Also see brglm package for bias-reduction in binomial-response GLMs (useful, e.g., in cases of complete separation); the exactLoglinTest package for exact tests of loglinear models; the clogit function in the survival package for conditional logistic regression; and the vcd package for graphical displays of categorical data.
• The gnm package estimates generalized nonlinear models, and can be used, e.g., to fit certain specialized models to mobility tables.

Other Regression Models:

It is possible to fit a very wide variety of regression models with the facilities provided by the base and recommended packages, and a much wider variety of models with contributed packages:

• Nonlinear regression: The nls function in the stats package fits nonlinear models by least-squares.
• Generalized least-squares regression and time-series regression: The gls function in the recommended nlme package fits models by generalized least squares. The lm function can also fit weighted least-squares regressions. Also see the dynlm package, which allows lm to handle time-series data structures, and the dyn package, which extends this capability to glm and other regression functions that are sufficiently similar to lm in their internal structure.
• Mixed-effects models: The recommended nlme package, associated with Pinheiro and Bates, Mixed-Effects Models in S and S-PLUS (Springer, 2000), fits linear and nonlinear mixed-effects models, commonly used in the social sciences for hierarchical and longitudinal data. Generalized linear mixed-effects models may be fit by the glmmPQL function in the MASS package, and by the lmer function in the Matrix package (related to the lme4 package, which largely supersedes nlme for linear mixed models). Also see the lmeSplines and lmm packages.
• Generalized estimating equations: The gee and geepack packages fit marginal models by generalized estimating equations.
• Nonparametric regression analysis: This is one of the conspicuous strengths of R. A standard R installation includes several functions for smoothing scatterplots, including loess.smooth and smooth.spline, both in the stats package. The loess function in the stats package fits simple and multiple-regression models by local polynomial regression. Generalized additive models are covered by several packages, including the recommended mgcv package, and the gam package, the latter associated with Hastie and Tibshirani, Generalized Additive Models (Chapman and Hall, 1990). Some other noteworthy contributed packages in this area are gss, which fits spline regressions, locfit, for local-polynomial regression (and also density estimation) (Loader, Local Regression and Likelihood, Springer, 1999), sm, for a variety of smoothing techniques, including for regression (Bowman and Azzalini, Applied Smoothing Techniques for Data Analysis, Oxford, 1997), and acepack for ACE (alternating conditional expecations) and AVAS (additivity and variance stabilization) nonparametric transformation of the response and explanatory variables in regression.
• Robust regression: The rlm function fits linear models by M-estimation and lqs computes bounded-influence estimators; both are in the MASS package. (The cov.rob function in the same package computes a robust covariance-matrix estimator.) Also see the quantreg package, which computes linear, nonlinear, and nonparametric quantile regressions; lmrob in robustbase and lmRob in robust for MM estimation.
• Structural-equation models: The sem package fits general (i.e., latent-variable) SEMs by FIML, and structural equations in observed-variable models by 2SLS. Categorical variables in SEMs can be accommodated via the polycor package. The systemfit package implements a wider variety of estimators for observed-variables models, including nonlinear simultaneous-equations models. See also the pls package, for partial least-squares estimation, and the gR task view for graphical models.
• Selection bias and censored regression: Censored regression models, such as the tobit model, can be fit by the survreg function in the recommended survival package. The rq function in the quantreg package can estimate censored quantile-regression models. The hurdle and zeroinfl functions in the pscl package fit hurdle and zero-inflated Poisson and negative-binomial models to count data. The heckit function in the micEcon package implements two-step Heckman estimators to correct for sample-selection bias. Also see under Survival Analysis below.

Other Statistical Methods:

Here is a brief survey of implementations in R of other statistical methods commonly used by social scientists:

• Survival (Event-History) Analysis: There is an extensive implementation of methods of survival analysis in the recommended survival package, which is associated with Therneau and Grambsch, Modeling Survival Data (Springer, 2000). Also see the eha, survrec, frailtypack, and rms packages.
• "Dimensional" Analysis: Exploratory maximum-likelihood factor analysis is implemented in the factanal function in the stats package, which also provides for varimax and promax factor rotation. (Confirmatory factor-analysis models can be fit with the sem package.) Additional rotations are available through functions in the GPArotation package. The prcomp and princomp functions in the stats package perform principal-components analysis. The cmdscale function in the stats package performs metric multidimensional scaling, while the isoMDS and sammon functions in the MASS package perform non-metric multidimensional scaling. For methods of cluster analysis and mixtures see the Cluster task view. The BradleyTerry2 package fits the Bradley-Terry model for paired comparisons. The ltm package fits Rasch and other item-response models to binary items. The irr package contains functions for assessing inter-rater reliability; also see the psy package.
• Other Multivariate Statistics: See the Multivariate task view, which includes information on graphs for visualizing multivariate data.
• Missing Data: A variety of packages implement methods for handling missing data by multiple imputation, including the mix, and pan packages associated with Shafer, Analysis of Incomplete Multivariate Data (Chapman and Hall, 1997), and the mice and mitools packages (the latter for drawing inferences from multiply imputed data sets). There are also some facilities for missing-data imputation in the general Hmisc package, which is described below, under "Collections" .
• Bootstrapping and Other Resampling Methods: The recommended package boot, associated with Davison and Hinkley, Bootstrap Methods and Their Application (Cambridge, 1997), has excellent facilities for bootstrapping and some related methods. Also notable is the bootstrap package, associated with Efron and Tibshirani, An Introduction to the Bootstrap (Chapman and Hall, 1993), which has functions for bootstrapping and jackknifing.
• Model Selection: The step function in the stats package and the more broadly applicable stepAIC function in the MASS package perform forward, backward, and forward-backward stepwise selection for a variety of statistical models. The regsubsets function in the leaps package performs all-subsets regression. The BMA package performs Bayesian model averaging. Beyond these, see the MachineLearning task view.
• Social Network Analysis: There are several packages useful for social network analysis, including sna for sociometric analysis of networks (e.g., blockmodeling), network for manipulating and displaying network objects, and latentnet for latent position and cluster models for networks.
• Bayesian Statistical Methods: Because of its easy programmability, R is a natural environment within which to implement and use Bayesian methods, and there are many packages that provide such methods, including interfaces to external Bayesian software, such as BUGS. For details, see the Bayesian task view.
• Spatial Statistics: In addition to the recommended spatial package, see the Spatial task view for an extensive list of functions and packages for spatial data analysis.
• Time-Series Analysis: Beyond time-series regression (see generalized least-squares regression, above), R has very extensive facilities for time-series analysis, both in the standard R distribution and in contributed packages; for details, see the Econometrics and Finance task views.
• Surveys: The sampling package includes functions for selecting survey samples; the survey package includes functions for the analysis of data from complex sample surveys, among them functions for fitting linear and generalized linear models.
• Meta Analysis: See the meta and rmeta packages.
• Propensity Scores and Matching: See the Matching, MatchIt, and optmatch packages.

Collections of Functions:

There are some packages that are so heterogeneous that they are difficult to classify, yet contain functions (typically in multiple domains) that are potentially of interest to social scientists:

• I have already made several references to the recommended MASS package, which is associated with Venables and Ripley's Modern Applied Statistics With S . Other recommended packages associated with this book are nnet, for fitting neural networks (but also, as mentioned, multinomial logistic-regression models); spatial for spatial statistics; and class, which contains functions for classification.
• The Hmisc and rms packages (both mentioned above), associated with Harrell, Regression Modeling Strategies (Springer, 2001), provide functions for data manipulation, linear models, logistic-regression models, and survival analysis, many of them "front ends" to or modifications of other facilities in R.
• The Zelig package integrates a wide array of statistical models of interest to social scientists (see the Zelig web site for details).