Missing data are very frequently found in datasets. Base R provides a few options to handle them using computations that involve only observed data (
na.rm = TRUE
in functions
mean,
var, ... or
use = complete.obs|na.or.complete|pairwise.complete.obs
in functions
cov,
cor, ...). The base package stats also contains the generic function
na.action
that extracts information of the
NA
action used to create an object.
These basic options are complemented by many packages on CRAN, which we structure into main topics:
In addition to the present task view, this
reference website on missing data
might also be helpful.
If you think that we missed some important packages in this list, please contact the maintainer.
Exploration of missing data
-
Manipulation of missing data
is implemented in the packages
sjmisc
and
sjlabelled.
memisc
also provides defineable missing values, along with infrastruture for the management of survey data and variable labels.
-
Missing data patterns
can be identified and explored using the packages
mi,
dlookr,
wrangle,
DescTools, and
naniar.
-
Graphics that describe distributions and patterns of missing data
are implemented in
VIM
(which has a Graphical User Interface, VIMGUI, currently archived on CRAN) and
naniar
(which abides by
tidyverse
principles).
-
Tests of the MAR assumption (versus the MCAR assumption)
are implemented in
MissMech
(a non parametric test).
RBtest
proposes a regression based approach to test the missing data mechanism.
-
Evaluation
with simulations can be performed using the function
ampute
of
mice.
Likelihood based approaches
-
Methods based on the Expectation Maximization (EM) algorithm
are implemented in
norm
(using the function
em.norm
for multivariate Gaussian data), in
cat
(function
em.cat
for multivariate categorical data), in
mix
(function
em.mix
for multivariate mixed categorical and continuous data). These packages also implement
Bayesian approaches
(with Imputation and Posterior steps) for the same models (functions
da.
XXX for
norm,
cat
and
mix) and can be used to obtain imputed complete datasets or multiple imputations (functions
imp.
XXX for
norm,
cat
and
mix), once the model parameters have been estimated.
imputeR
is a Multivariate Expectation-Maximization (EM) based imputation framework that offers several different algorithms, including Lasso, tree-based models or PCA. In addition,
TestDataImputation
implements imputation based on EM estimation (and other simpler imputation methods) that are well suited for dichotomous and polytomous tests with item responses.
-
Full Information Maximum Likelihood
(also known as "direct maximum likelihood" or "raw maximum likelihood") is available in
lavaan,
OpenMx
and
rsem, for handling missing data in structural equation modeling.
-
Bayesian approaches
for handling missing values in model based clustering with variable selection is available in
VarSelLCM. The package also provides imputation using the posterior mean.
-
Missing values in mixed-effect models and generalized linear models
are supported in the packages
mdmb,
icdGLM
and
JointAI, the last one being based on a Bayesian approach.
brlrmr
also handles MNAR values in response variable for logistic regression using an EM approach.
ui
implements uncertainty intervals for linear and probit regressions when the outcome is missing not at random.
-
Missing data in item response models
is implemented in
TAM,
mirt
and
ltm.
-
Robust
covariance estimation is implemented in the package
GSE. Robust location and scatter estimation and robust multivariate analysis with missing data are implemented in
rrcovNA.
Single imputation
-
The simplest method for missing data imputation is
imputation by mean
(or median, mode, ...). This approach is available in many packages among which
ForImp,
Hmisc, and
dlookr
that contain various proposals for imputing with the same value all missing instances of a variable.
-
k-nearest neighbors
is a popular method for missing data imputation that is available in many packages including
DMwR,
impute,
VIM,
GenForImp
and
yaImpute
(with many different methods for kNN imputation, including a CCA based imputation).
wNNSel
implements a kNN based method for imputation in large dimensional datasets.
isotree
uses a similar approach that is based on similarities between samples to impute missing data with isolation forests.
-
hot-deck
imputation is implemented in
hot.deck,
FHDI
and
VIM
(function
hotdeck).
StatMatch
uses hot-deck imputation to impute surveys from an external dataset.
impimp
also uses the notion of "donor" to impute a set of possible values, termed "imprecise imputation".
-
Other regression based imputations
are implemented in
VIM
(linear regression based imputation in the function
regressionImp). In addition,
simputation
is a general package for imputation by any prediction method that can be combined with various regression methods, and works well with the tidyverse.
WaverR
imputes data using a weighted average of several regressions.
iai
tunes optimal imputation based on knn, tree or SVM.
-
Based on random forest
in
missForest.
-
Based on copula
in
CoImp
and in
sbgcop
(semi-parametric Bayesian copula imputation). The latter supports multiple imputation.
-
PCA/Singular Value Decomposition/matrix completion
is implemented in the package
missMDA
for numerical, categorical and mixed data. Heterogeneous missingness in a high-dimensional PCA is also addressed in
primePCA.
softImpute
contains several methods for iterative matrix completion, as well as
filling,
rsparse
and
denoiseR
for numerical variables,
mimi
that uses low rank assumptions to impute mixed datasets, and
ECLRMC
performs ensemble correlation based low rank matrix completion that accounts for correlation among samples. The package
pcaMethods
offers some Bayesian implementation of PCA with missing data.
NIPALS
(based on SVD computation) is implemented in the packages
mixOmics
(for PCA and PLS),
ade4,
nipals
and
plsRglm
(for generalized model PLS). As a generalization,
tensorBF
implements imputation in 3-way tensor data.
ddsPLS
implements a multi-block imputation method based on PLS in a supervised framework.
ROptSpace
proposes a matrix completion method under low-rank assumption and collective matrix factorization for imputation using Bayesian matrix completion for groups of variables (binary, quantitative, Poisson). Imputation for groups is also available in
missMDA
in the function
imputeMFA.
-
Imputation of clustered data
using k-means is implemented in
ClustImpute.
-
Imputation for non-parametric regression by wavelet shrinkage
is implemented in
CVThresh
using solely maximization of the h-likelihood.
-
mi
and
VIM
also provide diagnostic plots to
evaluate the quality of imputation.
Multiple imputation
Some of the above mentioned packages can also handle multiple imputations.
-
Amelia
implements Bootstrap multiple imputation using EM to estimate the parameters, for quantitative data it imputes assuming a Multivariate Gaussian distribution. In addition, AmeliaView is a GUI for
Amelia, available from the
Amelia web page
.
NPBayesImputeCat
also implements multiple imputation by joint modelling for categorical variables with a Bayesian approach.
-
mi,
mice
and
smcfcs
implement multiple imputation by Chained Equations.
smcfcs
extends the models covered by the two previous packages.
miceFast
provides an alternative implementation of mice imputation methods using object oriented style programming and C++.
miceMNAR
imputes MNAR responses under Heckman selection model for use with
mice.
bootImpute
performs bootstrap based imputations and analyses of these imputations to use with
mice
or
smcfcs.
miceRanger
performs multiple imputation by chained equations using random forests.
-
missMDA
implements multiple imputation based on SVD methods.
-
hot.deck
implements hot-deck based multiple imputation.
-
Multilevel imputation
: Multilevel multiple imputation is implemented in
hmi,
jomo,
mice,
miceadds,
micemd,
mitml, and
pan.
-
Qtools
and
miWQS
implement multiple imputation based on quantile regression.
-
lodi
implements the imputation of observed values below the limit of detection (LOD) via censored likelihood multiple imputation (CLMI).
-
BaBooN
implements a Bayesian bootstrap approach for discrete data imputation that is based on Predictive Mean Matching (PMM).
-
accelmissing
provides multiple imputation with the zero-inflated Poisson lognormal model for missing count values in accelerometer data.
In addition,
mitools
provides a generic approach to handle multiple imputation in combination with any imputation method.
Weighting methods
-
Computation of weights
for observed data to account for unobserved data by
Inverse Probability Weighting (IPW)
is implemented in
ipw. IPW is also used for quantile estimations and boxplots in
IPWboxplot.
-
Doubly Robust Inverse Probability Weighted Augmented GEE Estimator with missing outcome
is implemented in
CRTgeeDR.
Specific types of data
-
Longitudinal data / time series and censored data
: Imputation for time series is implemented in
imputeTS
and
imputePSF. Other packages, such as
forecast,
spacetime,
timeSeries,
xts,
prophet,
stlplus
or
zoo, are dedicated to time series but also contain some (often basic) methods to handle missing data (see also
TimeSeries). To help fill down missing values for time series, the
padr
and
tsibble
packages provide methods for imputing implicit missing values. Imputation of time series based on Dynamic Time Warping is implemented in
DTWBI
for univariate time series and in
DTWUMI
or in
FSMUMI
for multivariate ones.
naniar
also imputes data below the range for exploratory graphical analysis with the function
impute_below.
TAR
implements an estimation of the autoregressive threshold models with Gaussian noise and of positive-valued time series with a Bayesian approach in the presence of missing data.
swgee
implements a probability weighted generalized estimating equations method for longitudinal data with missing observations and measurement error in covariates based on SIMEX.
icenReg
performs imputation for censored responses for interval data.
imputeTestbench
proposes tools to benchmark missing data imputation in univariate time series. On a related topic,
imputeFin
handles imputation of missing values in financial time series using AR models or random walk.
-
Spatial data
: Imputation for spatial data is implemented in
phylin
using interpolation with spatial distance weights or kriging.
gapfill
is dedicated to satellite data. Geostatistical interpolation of data with irregular spatial support is implemented in
rtop
and in
areal
that estimates values for overlapping but incongruent polygon features. Estimation and prediction for spatio-temporal data with missing values is implemented in
StempCens
with a SAEM approach that approximates EM when the E-step does not have an analytic form.
-
Spatio-temporal data
: Imputation for spatio-temporal data is implemented in the package
cutoffR
using different methods as knn and SVD and in
CircSpaceTime
for circular data using kriging. Similarly,
reddPrec
imputes missing values in daily precipitation time series accross different locations.
-
Graphs/networks
: Imputation for graphs/networks is implemented in the package
dils
to impute missing edges.
PST
provides a framework for analyzing Probabilistic Suffix Trees, including functions for learning and optimizing VLMC (variable length Markov chains) models from sets of individual sequences possibly containing missing values.
missSBM
imputes missing edges in Stochastic Block models and
cassandRa
predicts possible missing links with different stochastic network models.
-
Imputation for contingency tables
is implemented in
lori
that can also be used for the analysis of contingency tables with missing data.
-
Imputation for compositional data (CODA)
is implemented in
robCompositions
(based on kNN or EM approaches) and in
zCompositions
(various imputation methods for zeros, left-censored and missing data).
-
Imputation for diffusion processes
is implemented in
DiffusionRimp
by imputing missing sample paths with Brownian bridges.
-
Imputation for meta-analyses
of binary outcomes is provided in
metasens.
-
experiment
handles missing values in experimental design such as randomized experiments with missing covariate and outcome data, matched-pairs design with missing outcome.
-
cdparcoord
provides tools to handle missing values in parallel coordinates settings.
Specific application fields
-
Genomics
: Imputation for dropout events (
i.e.
, under-sampling of mRNA molecules) in single-cell RNA-Sequencing data is implemented in
DrImpute
and
Rmagic.
RNAseqNet
uses hot-deck imputation to improve RNA-seq network inference with an auxiliary dataset.
-
Epidemiology
:
powerlmm
implements power calculation for time x treatment effects in the presence of
dropouts
and missing data in mixed linear models and
pseval
evaluates principal surrogates in a single clinical trial in the presence of missing counterfactual surrogate responses.
idem
provides missing data imputation with a sensitivity analysis strategy to handle the unobserved functional outcomes not due to death.
sievePH
implements continuous, possibly multivariate, mark-specific hazard ratio with missing values in multivariate marks using an IPW approach.
-
Causal inference
: Causal inference with interactive fixed-effect models is available in
gsynth
with missing values handled by matrix completion.
MatchThem
matches multiply imputed datasets using several matching methods, and provides users with the tools to estimate causal effects in each imputed datasets.
-
Scoring
: Basic methods (mean, median, mode, ...) for imputing missing data in scoring datasets are proposed in
scorecardModelUtils.
-
Preference models
: Missing data in preference models are handled with a
Composite Link
approach that allows for MCAR and MNAR patterns to be taken into account in
prefmod.
-
Health economy
:
missingHE
implements models for health economic evaluations with missing outcome data.
-
Administrative records / Surveys
:
fastLink
provides a Fellegi-Sunter probabilistic record linkage that allows for missing data and the inclusion of auxiliary information.
EditImputeCont
provides imputation methods for continuous microdata under linear constraints with a Bayesian approach.
-
Regression and classification
:
eigenmodel
handles missing values in regression models for symmetric relational data.
randomForest
and
StratifiedRF
handle missing values in predictors for random forest.
mipred
handles prediction in generalized linear models and Cox prediction models with multiple imputation of predictors and
misaem
handles missing data in logistic regression.
psfmi
provides a framework for model selection for various linear models in multiply imputed datasets.
naivebayes
provides an efficient implementation of the naive Bayes classifier in the presence of missing data.
plsRbeta
implements PLS for beta regression models with missing data in the predictors.
-
Clustering
biclustermd
handles missing data in biclustering.
RMixtComp
fits various mixture models in the presence of missing data.
-
Tests
for two-sample paired missing data are implemented in
robustrank.
-
robustrao
computes the Rao-Stirling diversity index (a well-established bibliometric indicator to measure the interdisciplinarity of scientific publications) with data containing uncategorized references.