Exploring Data Sets

Daniel Lüdecke


Tidying up, transforming and exploring data is an important part of data analysis, and you can manage many common tasks in this process with the tidyverse or related packages. The sjmisc-package fits into this workflow, especially when you work with labelled data, because it offers functions for data transformation and labelled data utility functions. This vignette describes typical steps when beginning with data exploration.

The examples are based on data from the EUROFAMCARE project, a survey on the situation of family carers of older people in Europe. The sample data set efc is part of this package. Let us see how the family carer’s gender and subjective perception of negative impact of care as well as the cared-for person’s dependency are associated with the family carer’s quality of life.


Find variables in a data frame

Next, let’s look at the distribution of gender by the cared-for person’s dependency. To compute cross tables, you can use flat_table(). It requires the data as first argument, followed by any number of variable names.

But first, we need to know the name of the dependency-variable. This is where find_var() comes into play. It searches for variables in a data frame by

  1. variable names,
  2. variable labels,
  3. value labels
  4. or any combination of these.

By default, it looks for variable name and labels. The function also supports regex-patterns. By default, find_var() returns the column-indices, but you can also print a small “summary”" with the out-argument.

# find all variables with "dependency" in name or label
find_var(efc, "dependency", out = "table")
#>   col.nr var.name          var.label
#> 1      5   e42dep elder's dependency

Variable in column 5, named e42dep, is what we are looking for.

Recoding variables

Next, we need the negatice impact of care (neg_c_7) and want to create three groups: low, middle and high negative impact. We can easily recode and label vectors with rec(). This function does not only recode vectors, it also allows direct labelling of categories inside the recode-syntax (this is optional, you can also use the val.labels-argument). We now recode neg_c_7 into a new variable burden. The cut-points are a bit arbitrary, for the sake of demonstration.

efc$burden <- rec(
  rec = c("min:9=1 [low]; 10:12=2 [moderate]; 13:max=3 [high]; else=NA"),
  var.label = "Subjective burden",
  as.num = FALSE # we want a factor
# print frequencies
#> Subjective burden (x) <categorical>
#> # total N=908  valid N=892  mean=2.03  sd=0.81
#>  val    label frq raw.prc valid.prc cum.prc
#>    1      low 280   30.84     31.39   31.39
#>    2 moderate 301   33.15     33.74   65.13
#>    3     high 311   34.25     34.87  100.00
#>   NA       NA  16    1.76        NA      NA

You can see the variable burden has a variable label (“Subjective burden”), which was set inside rec(), as well as three values with labels (“low”, “moderate” and “high”). From the lowest value in neg_c_7 to 9 were recoded into 1, values 10 to 12 into 2 and values 13 to the highest value in neg_c_7 into 3. All remaining values are set to missing (else=NA – for details on the recode-syntax, see ?rec).

Grouped data frames

How is burden distributed by gender? We can group the data and print frequencies using frq() for this as well, as this function also accepts grouped data frames. Frequencies for grouped data frames first print the group-details (variable name and category), followed by the frequency table. Thanks to labelled data, the output is easy to understand.

efc %>% 
  select(burden, c161sex) %>% 
  group_by(c161sex) %>% 
#> Subjective burden (burden) <categorical>
#> # grouped by: Male
#> # total N=215  valid N=212  mean=1.91  sd=0.81
#>  val    label frq raw.prc valid.prc cum.prc
#>    1      low  80   37.21     37.74   37.74
#>    2 moderate  72   33.49     33.96   71.70
#>    3     high  60   27.91     28.30  100.00
#>   NA       NA   3    1.40        NA      NA
#> Subjective burden (burden) <categorical>
#> # grouped by: Female
#> # total N=686  valid N=679  mean=2.08  sd=0.81
#>  val    label frq raw.prc valid.prc cum.prc
#>    1      low 199   29.01     29.31   29.31
#>    2 moderate 229   33.38     33.73   63.03
#>    3     high 251   36.59     36.97  100.00
#>   NA       NA   7    1.02        NA      NA

Nested data frames

Let’s investigate the association between quality of life and burden across the different dependency categories, by fitting linear models for each category of e42dep. We can do this using nested data frames. nest() from the tidyr-package can create subsets of a data frame, based on grouping criteria, and create a new list-variable, where each element itself is a data frame (so it’s nested, because we have data frames inside a data frame).

In the following example, we group the data by e42dep, and “nest” the groups. Now we get a data frame with two columns: First, the grouping variable (e42dep) and second, the datasets (subsets) for each country as data frame, stored in the list-variable data. The data frames in the subsets (in data) all contain the selected variables burden, c161sex and quol_5 (quality of life).

# convert variable to labelled factor, because we then 
# have the labels as factor levels in the output
efc$e42dep <- to_label(efc$e42dep, drop.levels = T)
efc %>%
  select(e42dep, burden, c161sex, quol_5) %>%
  group_by(e42dep) %>%
#> # A tibble: 5 x 2
#>   e42dep               data              
#>   <fct>                <list>            
#> 1 moderately dependent <tibble [306 x 3]>
#> 2 severely dependent   <tibble [304 x 3]>
#> 3 independent          <tibble [66 x 3]> 
#> 4 slightly dependent   <tibble [225 x 3]>
#> 5 <NA>                 <tibble [7 x 3]>

Get coefficients of nested models

Using map() from the purrr-package, we can iterate this list and apply any function on each data frame in the list-variable “data”. We want to apply the lm()-function to the list-variable, to run linear models for all “dependency-datasets”. The results of these linear regressions are stored in another list-variable, models (created with mutate()). To quickly access and look at the coefficients, we can use spread_coef().

efc %>%
  select(e42dep, burden, c161sex, quol_5) %>%
  group_by(e42dep) %>%
  tidyr::nest() %>% 
  na.omit() %>%       # remove nested group for NA
  arrange(e42dep) %>% # arrange by order of levels
  mutate(models = purrr::map(
    data, ~ 
    lm(quol_5 ~ burden + c161sex, data = .))
  ) %>%
#> # A tibble: 4 x 7
#>   e42dep          data         models `(Intercept)` burden2 burden3 c161sex
#>   <fct>           <list>       <list>         <dbl>   <dbl>   <dbl>   <dbl>
#> 1 independent     <tibble [66~ <lm>            18.8   -3.16   -4.94  -0.709
#> 2 slightly depen~ <tibble [22~ <lm>            19.8   -2.20   -2.48  -1.14 
#> 3 moderately dep~ <tibble [30~ <lm>            17.9   -1.82   -5.29  -0.637
#> 4 severely depen~ <tibble [30~ <lm>            19.1   -3.66   -7.92  -0.746

We see that higher burden is associated with lower quality of life, for all dependency-groups. The se and p.val-arguments add standard errors and p-values to the output. model.term returns the statistics only for a specific term. If you specify a model.term, arguments se and p.val automatically default to TRUE.

efc %>%
  select(e42dep, burden, c161sex, quol_5) %>%
  group_by(e42dep) %>%
  tidyr::nest() %>% 
  na.omit() %>%       # remove nested group for NA
  arrange(e42dep) %>% # arrange by order of levels
  mutate(models = purrr::map(
    data, ~ 
    lm(quol_5 ~ burden + c161sex, data = .))
  ) %>%
  spread_coef(models, burden3)
#> # A tibble: 4 x 6
#>   e42dep               data               models burden3 std.error  p.value
#>   <fct>                <list>             <list>   <dbl>     <dbl>    <dbl>
#> 1 independent          <tibble [66 x 3]>  <lm>     -4.94     2.20  2.84e- 2
#> 2 slightly dependent   <tibble [225 x 3]> <lm>     -2.48     0.694 4.25e- 4
#> 3 moderately dependent <tibble [306 x 3]> <lm>     -5.29     0.669 5.22e-14
#> 4 severely dependent   <tibble [304 x 3]> <lm>     -7.92     0.875 2.10e-17