As a journal editor, I often receive studies in which the investigators fail to describe, analyse, or even acknowledge missing data. This is frustrating, as it is often of the utmost importance. Conclusions may (and do) change when missing data is accounted for. A few seem to not even appreciate that in conventional regression, only rows with complete data are included.
These are the five steps to ensuring missing data are correctly identified and appropriately dealt with:
finalfit
includes a number of functions to help with this.
But first there are some terms which easy to mix up. These are important as they describe the mechanism of missingness and this determines how you can handle the missing data.
As it says, values are randomly missing from your dataset. Missing data values do not relate to any other data in the dataset and there is no pattern to the actual values of the missing data themselves.
For instance, when smoking status is not recorded in a random subset of patients.
This is easy to handle, but unfortunately, data are almost never missing completely at random.
This is confusing and would be better stated as missing conditionally at random. Here, missing data do have a relationship with other variables in the dataset. However, the actual values that are missing are random.
For example, smoking status is not documented in female patients because the doctor was too shy to ask. Yes ok, not that realistic!
The pattern of missingness is related to other variables in the dataset, but in addition, the values of the missing data are not random.
For example, when smoking status is not recorded in patients admitted as an emergency, who are also more likely to have worse outcomes from surgery.
Missing not at random data are important, can alter your conclusions, and are the most difficult to diagnose and handle. They can only be detected by collecting and examining some of the missing data. This is often difficult or impossible to do.
How you deal with missing data is dependent on the type of missingness. Once you know this, then you can sort it.
ff_glimpse
While clearly obvious, this step is often ignored in the rush to get results. The first step in any analysis is robust data cleaning and coding. Lots of packages have a glimpse-type function and finalfit
is no different. This function has three specific goals:
Using the colon_s
cancer dataset that comes with finalfit
, we are interested in exploring the association between a cancer obstructing the bowel and 5-year survival, accounting for other patient and disease characteristics.
For demonstration purposes, we will create random MCAR and MAR smoking variables to the dataset.
library(finalfit)
# Create some extra missing data
## Smoking missing completely at random
set.seed(1)
colon_s$smoking_mcar =
sample(c("Smoker", "Non-smoker", NA),
dim(colon_s)[1], replace=TRUE,
prob = c(0.2, 0.7, 0.1)) %>%
factor()
Hmisc::label(colon_s$smoking_mcar) = "Smoking (MCAR)"
## Smoking missing conditional on patient sex
colon_s$smoking_mar[colon_s$sex.factor == "Female"] =
sample(c("Smoker", "Non-smoker", NA),
sum(colon_s$sex.factor == "Female"),
replace = TRUE,
prob = c(0.1, 0.5, 0.4))
colon_s$smoking_mar[colon_s$sex.factor == "Male"] =
sample(c("Smoker", "Non-smoker", NA),
sum(colon_s$sex.factor == "Male"),
replace=TRUE, prob = c(0.15, 0.75, 0.1))
colon_s$smoking_mar = factor(colon_s$smoking_mar)
Hmisc::label(colon_s$smoking_mar) = "Smoking (MAR)"
# Examine with ff_glimpse
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor",
"smoking_mcar", "smoking_mar")
dependent = "mort_5yr"
colon_s %>%
ff_glimpse(dependent, explanatory)
#> Continuous
#> label var_type n missing_n missing_percent mean sd
#> age Age (years) <S3: labelled> 929 0 0.0 59.8 11.9
#> nodes nodes <dbl> 911 18 1.9 3.7 3.6
#> min quartile_25 median quartile_75 max
#> age 18.0 53.0 61.0 69.0 85.0
#> nodes 0.0 1.0 2.0 5.0 33.0
#>
#> Categorical
#> label var_type n missing_n missing_percent
#> sex.factor Sex <fct> 929 0 0.0
#> obstruct.factor Obstruction <fct> 908 21 2.3
#> mort_5yr Mortality 5 year <fct> 915 14 1.5
#> smoking_mcar Smoking (MCAR) <fct> 828 101 10.9
#> smoking_mar Smoking (MAR) <fct> 719 210 22.6
#> levels_n levels levels_count
#> sex.factor 2 "Female", "Male" 445, 484
#> obstruct.factor 2 "No", "Yes", "(Missing)" 732, 176, 21
#> mort_5yr 2 "Alive", "Died", "(Missing)" 511, 404, 14
#> smoking_mcar 2 "Non-smoker", "Smoker", "(Missing)" 645, 183, 101
#> smoking_mar 2 "Non-smoker", "Smoker", "(Missing)" 591, 128, 210
#> levels_percent
#> sex.factor 48, 52
#> obstruct.factor 78.8, 18.9, 2.3
#> mort_5yr 55.0, 43.5, 1.5
#> smoking_mcar 69, 20, 11
#> smoking_mar 64, 14, 23
The function summarises a data frame or tibble by numeric (continuous) variables and factor (discrete) variables. The dependent and explanatory are for convenience. Pass either or neither e.g. to summarise data frame or tibble:
It doesn’t present well if you have factors with lots of levels, so you may want to remove these.
Use this to check that the variables are all assigned and behaving as expected. The proportion of missing data can be seen, e.g. smoking_mar has 23% missing data.
missing_plot
In detecting patterns of missingness, this plot is useful. Row number is on the x-axis and all included variables are on the y-axis. Associations between missingness and observations can be easily seen, as can relationships of missingness between variables.
It was only when writing this post that I discovered the amazing package, naniar. This package is recommended and provides lots of great visualisations for missing data.
missing_pattern
missing_pattern
simply wraps mice::md.pattern
using finalfit
grammar. This produces a table and a plot showing the pattern of missingness between variables.
explanatory = c("age", "sex.factor",
"obstruct.factor",
"smoking_mcar", "smoking_mar")
dependent = "mort_5yr"
colon_s %>%
missing_pattern(dependent, explanatory)
#> age sex.factor mort_5yr obstruct.factor smoking_mcar smoking_mar
#> 617 1 1 1 1 1 1 0
#> 181 1 1 1 1 1 0 1
#> 74 1 1 1 1 0 1 1
#> 22 1 1 1 1 0 0 2
#> 16 1 1 1 0 1 1 1
#> 2 1 1 1 0 1 0 2
#> 2 1 1 1 0 0 1 2
#> 1 1 1 1 0 0 0 3
#> 8 1 1 0 1 1 1 1
#> 4 1 1 0 1 1 0 2
#> 2 1 1 0 1 0 1 2
#> 0 0 14 21 101 210 346
This allows us to look for patterns of missingness between variables. There are 14 patterns in this data. The number and pattern of missingness help us to determine the likelihood of it being random rather than systematic.
Table 1 in a healthcare study is often a demographics table of an “explanatory variable of interest” against other explanatory variables/confounders. Do not silently drop missing values in this table. It is easy to do this correctly with summary_factorlist. This function provides a useful summary of a dependent variable against explanatory variables. Despite its name, continuous variables are handled nicely.
na_include=TRUE
ensures missing data from the explanatory variables (but not dependent) are included. Note that any p-values are generated across missing groups as well, so run a second time with na_include=FALSE
if you wish a hypothesis test only over observed data.
# Explanatory or confounding variables
explanatory = c("age", "sex.factor",
"nodes",
"smoking_mcar", "smoking_mar")
# Explanatory variable of interest
dependent = "obstruct.factor" # Bowel obstruction
colon_s %>%
summary_factorlist(dependent, explanatory,
na_include=TRUE, p=TRUE)
#> label levels No Yes p
#> 1 Age (years) Mean (SD) 60.2 (11.5) 57.3 (13.3) 0.014
#> 3 Sex Female 346 (79.2) 91 (20.8) 0.290
#> 4 Male 386 (82.0) 85 (18.0)
#> 2 nodes Mean (SD) 3.7 (3.7) 3.5 (3.2) 0.774
#> 8 Smoking (MCAR) Non-smoker 500 (79.4) 130 (20.6) 0.173
#> 9 Smoker 154 (85.6) 26 (14.4)
#> 10 Missing 78 (79.6) 20 (20.4)
#> 5 Smoking (MAR) Non-smoker 467 (80.9) 110 (19.1) 0.056
#> 6 Smoker 91 (73.4) 33 (26.6)
#> 7 Missing 174 (84.1) 33 (15.9)
missing_pairs
| missing_compare
In deciding whether data is MCAR or MAR, one approach is to explore patterns of missingness between levels of included variables. This is particularly important (I would say absolutely required) for a primary outcome measure / dependent variable.
Take for example “death”. When that outcome is missing it is often for a particular reason. For example, perhaps patients undergoing emergency surgery were less likely to have complete records compared with those undergoing planned surgery. And of course, death is more likely after emergency surgery.
missing_pairs
uses functions from the excellent GGally package. It produces pairs plots to show relationships between missing values and observed values in all variables.
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor",
"smoking_mcar", "smoking_mar")
dependent = "mort_5yr"
colon_s %>%
missing_pairs(dependent, explanatory)
For continuous variables (age and nodes), the distributions of observed and missing data can be visually compared. Is there a difference between age and mortality above?
For discrete, data, counts are presented by default. It is often easier to compare proportions: