The purpose of the labelled package is to provide functions to manipulate metadata as variable labels, value labels and defined missing values using the haven_labelled
and haven_labelled_spss
classes introduced in haven
package.
A variable label could be specified for any vector using var_label
.
It’s possible to add a variable label to several columns of a data frame using a named list.
To get the variable label, simply call var_label
.
## [1] "Width of Petal"
## $Sepal.Length
## [1] "Length of sepal"
##
## $Sepal.Width
## NULL
##
## $Petal.Length
## [1] "Length of petal"
##
## $Petal.Width
## [1] "Width of Petal"
##
## $Species
## NULL
To remove a variable label, use NULL
.
In RStudio, variable labels will be displayed in data viewer.
You can display and search through variable names and labels with look_for()
:
## variable label
## 1 Sepal.Length <NA>
## 2 Sepal.Width <NA>
## 3 Petal.Length Length of petal
## 4 Petal.Width Width of Petal
## 5 Species <NA>
## variable label
## 3 Petal.Length Length of petal
## 4 Petal.Width Width of Petal
## variable label class type levels
## 1 Sepal.Length <NA> numeric double
## 2 Sepal.Width <NA> numeric double
## 3 Petal.Length Length of petal numeric double
## 4 Petal.Width Width of Petal numeric double
## 5 Species <NA> factor integer setosa; versicolor; virginica
## value_labels unique_values n_na na_values na_range
## 1 35 0
## 2 23 0
## 3 43 0
## 4 22 0
## 5 3 0
The first way to create a labelled vector is to use the labelled
function. It’s not mandatory to provide a label for each value observed in your vector. You can also provide a label for values not observed.
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 3 no
## 8 don't know
## 9 refused
Use val_labels
to get all value labels and val_label
to get the value label associated with a specific value.
## yes no don't know refused
## 1 3 8 9
## [1] "don't know"
val_labels
could also be used to modify all the value labels attached to a vector, while val_label
will update only one specific value label.
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 3 nno
## 5 bug
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 3 no
## 5 bug
With val_label
, you can also add or remove specific value labels.
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 3 no
## 2 maybe
To remove all value labels, use val_labels
and NULL
. The labelled
class will also be removed.
## [1] 1 2 2 2 3 9 1 3 2 NA
Adding a value label to a non labelled vector will apply labelled
class to it.
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
Note that applying val_labels
to a factor will have no effect!
## [1] 1 2 3
## Levels: 1 2 3
## [1] 1 2 3
## Levels: 1 2 3
You could also apply value labels to several columns of a data frame.
df <- data.frame(v1 = 1:3, v2 = c(2, 3, 1), v3 = 3:1)
val_label(df, 1) <- "yes"
val_label(df[, c("v1", "v3")], 2) <- "maybe"
val_label(df[, c("v2", "v3")], 3) <- "no"
val_labels(df)
## $v1
## yes maybe
## 1 2
##
## $v2
## yes no
## 1 3
##
## $v3
## yes maybe no
## 1 2 3
## $v1
## YES MAYBE NO
## 1 2 3
##
## $v2
## yes no
## 1 3
##
## $v3
## YES MAYBE NO
## 1 2 3
## $v1
## NULL
##
## $v2
## NULL
##
## $v3
## NULL
## $v1
## yes no
## 1 3
##
## $v2
## a b c
## 1 2 3
##
## $v3
## NULL
Value labels are sorted by default in the order they have been created.
v <- c(1,2,2,2,3,9,1,3,2,NA)
val_label(v, 1) <- "yes"
val_label(v, 3) <- "no"
val_label(v, 9) <- "refused"
val_label(v, 2) <- "maybe"
val_label(v, 8) <- "don't know"
v
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 3 no
## 9 refused
## 2 maybe
## 8 don't know
It could be useful to reorder the value labels according to their attached values.
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 2 maybe
## 3 no
## 8 don't know
## 9 refused
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 9 refused
## 8 don't know
## 3 no
## 2 maybe
## 1 yes
If you prefer, you can also sort them according to the labels.
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 8 don't know
## 2 maybe
## 3 no
## 9 refused
## 1 yes
haven
(>= 2.0.0) introduced an additional haven_labelled_spss
class to deal with user defined missing values. In such case, additional atributes will be used to indicate with values should be considered as missing, but such values will not be stored as internal NA
values. You should note that most R function will not take this information into account. Therefore, you will have to convert missing values into NA
if required before analysis. These defined missing values could co-exist with internal NA
values.
It is possible to manipulate this missing values with na_values()
and na_range()
. Note that is.na()
will return TRUE
as well for user-defined missing values.
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 3 no
## 9 don't know
## [1] 9
## <Labelled SPSS double>
## [1] 1 2 2 2 3 9 1 3 2 NA
## Missing values: 9
##
## Labels:
## value label
## 1 yes
## 3 no
## 9 don't know
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 3 no
## 9 don't know
## [1] 5 Inf
## <Labelled SPSS double>
## [1] 1 2 2 2 3 9 1 3 2 NA
## Missing range: [5, Inf]
##
## Labels:
## value label
## 1 yes
## 3 no
## 9 don't know
Since version 2.1.0, it is not mandatory to define at least one value label before defining missing values.
To convert user defined missing values into NA
, simply use user_na_to_na()
.
## <Labelled SPSS integer>
## [1] 1 2 3 4 5 6 7 8 9 10
## Missing values: 9, 10
##
## Labels:
## value label
## 1 Good
## 8 Bad
## <Labelled integer>
## [1] 1 2 3 4 5 6 7 8 NA NA
##
## Labels:
## value label
## 1 Good
## 8 Bad
You can also remove user missing values definition without converting these values to NA
.
## <Labelled SPSS integer>
## [1] 1 2 3 4 5 6 7 8 9 10
## Missing values: 9, 10
##
## Labels:
## value label
## 1 Good
## 8 Bad
## Some user defined missing values have been removed but not converted to NA.
## <Labelled integer>
## [1] 1 2 3 4 5 6 7 8 9 10
##
## Labels:
## value label
## 1 Good
## 8 Bad
or
## <Labelled SPSS integer>
## [1] 1 2 3 4 5 6 7 8 9 10
## Missing values: 9, 10
##
## Labels:
## value label
## 1 Good
## 8 Bad
## <Labelled integer>
## [1] 1 2 3 4 5 6 7 8 9 10
##
## Labels:
## value label
## 1 Good
## 8 Bad
In some cases, values who don’t have an attached value label could be considered as missing. nolabel_to_na
will convert them to NA
.
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 2 maybe
## 3 no
## <Labelled double>
## [1] 1 2 2 2 3 NA 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 2 maybe
## 3 no
In other cases, a value label is attached only to specific values that corresponds to a missing value. For example:
## <Labelled double>
## [1] 1.88 1.62 1.78 99.00 1.91
##
## Labels:
## value label
## 99 not measured
In such cases, val_labels_to_na
could be appropriate.
## [1] 1.88 1.62 1.78 NA 1.91
These two functions could also be applied to an overall data frame. Only labelled vectors will be impacted.
A labelled vector could easily be converted to a factor with to_factor
.
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 3 no
## 8 don't know
## 9 refused
## [1] yes 2 2 2 no refused yes no 2
## [10] <NA>
## Levels: yes 2 no don't know refused
The levels
argument allows to specify what should be used as the factor levels, i.e. the labels (default), the values or the labels prefixed with values.
## [1] 1 2 2 2 3 9 1 3 2 <NA>
## Levels: 1 2 3 8 9
## [1] [1] yes [2] 2 [2] 2 [2] 2 [3] no [9] refused
## [7] [1] yes [3] no [2] 2 <NA>
## Levels: [1] yes [2] 2 [3] no [8] don't know [9] refused
The ordered
argument will create an ordinal factor.
## [1] yes 2 2 2 no refused yes no 2
## [10] <NA>
## Levels: yes < 2 < no < don't know < refused
The argument nolabel_to_na
specify if the corresponding function should be applied before converting to a factor. Therefore, the two following commands are equivalent.
## [1] yes <NA> <NA> <NA> no refused yes no <NA>
## [10] <NA>
## Levels: yes no don't know refused
## [1] yes <NA> <NA> <NA> no refused yes no <NA>
## [10] <NA>
## Levels: yes no don't know refused
sort_levels
specifies how the levels should be sorted: "none"
to keep the order in which value labels have been defined, "values"
to order the levels according to the values and "labels"
according to the labels. "auto"
(default) will be equivalent to "none"
except if some values with no attached labels are found and are not dropped. In that case, "values"
will be used.
## [1] yes 2 2 2 no refused yes no 2
## [10] <NA>
## Levels: yes no don't know refused 2
## [1] yes 2 2 2 no refused yes no 2
## [10] <NA>
## Levels: yes 2 no don't know refused
## [1] yes 2 2 2 no refused yes no 2
## [10] <NA>
## Levels: 2 don't know no refused yes
The function to_labelled
could be used to turn a factor into a labelled numeric vector.
## <Labelled double>
## [1] 1 2 3
##
## Labels:
## value label
## 1 a
## 2 b
## 3 c
Note that to_labelled(to_factor(v))
will not be equal to v
due to the way factors are stored internally by R.
## <Labelled double>
## [1] 1 2 2 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 3 no
## 8 don't know
## 9 refused
## <Labelled double>
## [1] 1 2 2 2 3 5 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 2 2
## 3 no
## 4 don't know
## 5 refused
In haven package, read_spss
, read_stata
and read_sas
are natively importing data using the labelled
class and the label
attribute for variable labels.
Functions from foreign package could also import some metadata from SPSS and Stata files. to_labelled
can convert data imported with foreign into a labelled data frame. However, there are some limitations compared to using haven:
use.value.labels = FALSE
, to.data.frame = FALSE
and use.missings = FALSE
when calling read.spss
. If use.value.labels = TRUE
, variable with value labels will be converted into factors by read.spss
(and kept as factors by foreign_to_label
). If to.data.frame = TRUE
, meta data describing the missing values will not be imported. If use.missings = TRUE
, missing values would have been converted to NA
by read.spss
.convert.factors = FALSE
when calling read.dta
to avoid conversion of variables with value labels into factors. So far, missing values defined in Stata are always imported as NA
by read.dta
and could not be retrieved by foreign_to_labelled
.The memisc package provide functions to import variable metadata and store them in specific object of class data.set
. The to_labelled
method can convert a data.set into a labelled data frame.
# from foreign
library(foreign)
df <- to_labelled(read.spss(
"file.sav",
to.data.frame = FALSE,
use.value.labels = FALSE,
use.missings = FALSE
))
df <- to_labelled(read.dta(
"file.dta",
convert.factors = FALSE
))
# from memisc
library(memisc)
nes1948.por <- UnZip("anes/NES1948.ZIP", "NES1948.POR", package="memisc")
nes1948 <- spss.portable.file(nes1948.por)
df <- to_labelled(nes1948)
ds <- as.data.set(nes19480)
df <- to_labelled(ds)
If you are using the %>%
operator, you can use the functions set_variable_labels
, set_value_labels
, add_value_labels
and remove_value_labels
.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- data_frame(s1 = c("M", "M", "F"), s2 = c(1, 1, 2)) %>%
set_variable_labels(s1 = "Sex", s2 = "Question") %>%
set_value_labels(s1 = c(Male = "M", Female = "F"), s2 = c(Yes = 1, No = 2))
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
## <Labelled double>: Question
## [1] 1 1 2
##
## Labels:
## value label
## 1 Yes
## 2 No
set_value_labels
will replace the list of value labels while add_value_labels
will update it.
## <Labelled double>: Question
## [1] 1 1 2
##
## Labels:
## value label
## 1 Yes
## 8 Don't know
## 9 Unknown
## <Labelled double>: Question
## [1] 1 1 2
##
## Labels:
## value label
## 1 Yes
## 8 Don't know
## 9 Unknown
## 2 No
You can also remove some variable and/or value labels.
df <- df %>%
set_variable_labels(s1 = NULL)
# removing one value label
df <- df %>%
remove_value_labels(s2 = 2)
df$s2
## <Labelled double>: Question
## [1] 1 1 2
##
## Labels:
## value label
## 1 Yes
## 8 Don't know
## 9 Unknown
## <Labelled double>: Question
## [1] 1 1 2
##
## Labels:
## value label
## 1 Yes
## [1] 1 1 2
## attr(,"label")
## [1] "Question"
To convert variables, you can use functions as mutate_if
or mutate_at
. See the example below.
##
## Attaching package: 'questionr'
## The following object is masked from 'package:labelled':
##
## lookfor
## Observations: 2,000
## Variables: 17
## $ id_woman <dbl> 391, 1643, 85, 881, 1981, 1072, 1978, 1607, 738, 16…
## $ id_household <dbl> 381, 1515, 85, 844, 1797, 1015, 1794, 1486, 711, 15…
## $ weight <dbl> 1.803150, 1.803150, 1.803150, 1.803150, 1.803150, 0…
## $ interview_date <date> 2012-05-05, 2012-01-23, 2012-01-21, 2012-01-06, 20…
## $ date_of_birth <date> 1997-03-07, 1982-01-06, 1979-01-01, 1968-03-29, 19…
## $ age <dbl> 15, 30, 33, 43, 25, 18, 45, 23, 49, 31, 26, 45, 25,…
## $ residency <dbl+lbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ region <dbl+lbl> 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,…
## $ instruction <dbl+lbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0,…
## $ employed <dbl+lbl> 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ matri <dbl+lbl> 0, 2, 2, 2, 1, 0, 1, 1, 2, 5, 2, 3, 0, 2, 1, 2,…
## $ religion <dbl+lbl> 1, 3, 2, 3, 2, 2, 3, 1, 3, 3, 2, 3, 2, 2, 2, 2,…
## $ newspaper <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ radio <dbl+lbl> 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0,…
## $ tv <dbl+lbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,…
## $ ideal_nb_children <dbl+lbl> 4, 4, 4, 4, 4, 5, 10, 5, 4, 5, 6, 10, 2, 6, 6, …
## $ test <dbl+lbl> 0, 9, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0,…
## Observations: 2,000
## Variables: 17
## $ id_woman <dbl> 391, 1643, 85, 881, 1981, 1072, 1978, 1607, 738, 16…
## $ id_household <dbl> 381, 1515, 85, 844, 1797, 1015, 1794, 1486, 711, 15…
## $ weight <dbl> 1.803150, 1.803150, 1.803150, 1.803150, 1.803150, 0…
## $ interview_date <date> 2012-05-05, 2012-01-23, 2012-01-21, 2012-01-06, 20…
## $ date_of_birth <date> 1997-03-07, 1982-01-06, 1979-01-01, 1968-03-29, 19…
## $ age <dbl> 15, 30, 33, 43, 25, 18, 45, 23, 49, 31, 26, 45, 25,…
## $ residency <fct> rural, rural, rural, rural, rural, rural, rural, ru…
## $ region <fct> West, West, West, West, West, South, South, South, …
## $ instruction <fct> none, none, none, none, primary, none, none, none, …
## $ employed <fct> yes, yes, no, yes, yes, no, yes, no, yes, yes, yes,…
## $ matri <fct> single, living together, living together, living to…
## $ religion <fct> Muslim, Protestant, Christian, Protestant, Christia…
## $ newspaper <fct> no, no, no, no, no, no, no, no, no, no, no, no, no,…
## $ radio <fct> no, yes, yes, no, no, yes, yes, no, no, no, yes, ye…
## $ tv <fct> no, no, no, no, no, yes, no, no, no, no, yes, yes, …
## $ ideal_nb_children <fct> 4, 4, 4, 4, 4, 5, 10, 5, 4, 5, 6, 10, 2, 6, 6, 6, 4…
## $ test <fct> no, missing, no, no, yes, no, no, no, no, yes, yes,…
## Observations: 2,000
## Variables: 17
## $ id_woman <dbl> 391, 1643, 85, 881, 1981, 1072, 1978, 1607, 738, 16…
## $ id_household <dbl> 381, 1515, 85, 844, 1797, 1015, 1794, 1486, 711, 15…
## $ weight <dbl> 1.803150, 1.803150, 1.803150, 1.803150, 1.803150, 0…
## $ interview_date <date> 2012-05-05, 2012-01-23, 2012-01-21, 2012-01-06, 20…
## $ date_of_birth <date> 1997-03-07, 1982-01-06, 1979-01-01, 1968-03-29, 19…
## $ age <dbl> 15, 30, 33, 43, 25, 18, 45, 23, 49, 31, 26, 45, 25,…
## $ residency <fct> rural, rural, rural, rural, rural, rural, rural, ru…
## $ region <fct> West, West, West, West, West, South, South, South, …
## $ instruction <fct> none, none, none, none, primary, none, none, none, …
## $ employed <fct> yes, yes, no, yes, yes, no, yes, no, yes, yes, yes,…
## $ matri <fct> single, living together, living together, living to…
## $ religion <fct> Muslim, Protestant, Christian, Protestant, Christia…
## $ newspaper <fct> no, no, no, no, no, no, no, no, no, no, no, no, no,…
## $ radio <fct> no, yes, yes, no, no, yes, yes, no, no, no, yes, ye…
## $ tv <fct> no, no, no, no, no, yes, no, no, no, no, yes, yes, …
## $ ideal_nb_children <fct> 4, 4, 4, 4, 4, 5, 10, 5, 4, 5, 6, 10, 2, 6, 6, 6, 4…
## $ test <fct> no, missing, no, no, yes, no, no, no, no, yes, yes,…
## Observations: 2,000
## Variables: 17
## $ id_woman <dbl> 391, 1643, 85, 881, 1981, 1072, 1978, 1607, 738, 16…
## $ id_household <dbl> 381, 1515, 85, 844, 1797, 1015, 1794, 1486, 711, 15…
## $ weight <dbl> 1.803150, 1.803150, 1.803150, 1.803150, 1.803150, 0…
## $ interview_date <date> 2012-05-05, 2012-01-23, 2012-01-21, 2012-01-06, 20…
## $ date_of_birth <date> 1997-03-07, 1982-01-06, 1979-01-01, 1968-03-29, 19…
## $ age <dbl> 15, 30, 33, 43, 25, 18, 45, 23, 49, 31, 26, 45, 25,…
## $ residency <dbl+lbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ region <dbl+lbl> 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,…
## $ instruction <dbl+lbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0,…
## $ employed <fct> yes, yes, no, yes, yes, no, yes, no, yes, yes, yes,…
## $ matri <fct> single, living together, living together, living to…
## $ religion <fct> Muslim, Protestant, Christian, Protestant, Christia…
## $ newspaper <fct> no, no, no, no, no, no, no, no, no, no, no, no, no,…
## $ radio <fct> no, yes, yes, no, no, yes, yes, no, no, no, yes, ye…
## $ tv <fct> no, no, no, no, no, yes, no, no, no, no, yes, yes, …
## $ ideal_nb_children <dbl+lbl> 4, 4, 4, 4, 4, 5, 10, 5, 4, 5, 6, 10, 2, 6, 6, …
## $ test <dbl+lbl> 0, 9, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0,…