Generate a data dictionnary and search for variables with look_for()

Joseph Larmarange

Showing a summary of a data frame

Default printing of tibbles

It is a common need to easily get a description of all variables in a data frame.

When a data frame is converted into a tibble (e.g. with dplyr::as_tibble()), it as a nice printing showing the first rows of the data frame as well as the type of column.

library(dplyr)
iris %>% as_tibble()
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ... with 140 more rows

However, when you have too many variables, all of them cannot be printed and their are just listed.

data(fertility, package = "questionr")
women
## # A tibble: 2,000 x 17
##    id_woman id_household weight interview_date date_of_birth   age residency
##       <dbl>        <dbl>  <dbl> <date>         <date>        <dbl> <dbl+lbl>
##  1      391          381  1.80  2012-05-05     1997-03-07       15 2 [rural]
##  2     1643         1515  1.80  2012-01-23     1982-01-06       30 2 [rural]
##  3       85           85  1.80  2012-01-21     1979-01-01       33 2 [rural]
##  4      881          844  1.80  2012-01-06     1968-03-29       43 2 [rural]
##  5     1981         1797  1.80  2012-05-11     1986-05-25       25 2 [rural]
##  6     1072         1015  0.998 2012-02-20     1993-07-03       18 2 [rural]
##  7     1978         1794  0.998 2012-02-23     1967-01-28       45 2 [rural]
##  8     1607         1486  0.998 2012-02-20     1989-01-21       23 2 [rural]
##  9      738          711  0.192 2012-03-09     1962-07-24       49 2 [rural]
## 10     1656         1525  0.192 2012-03-15     1980-12-25       31 2 [rural]
## # ... with 1,990 more rows, and 10 more variables: region <dbl+lbl>,
## #   instruction <dbl+lbl>, employed <dbl+lbl>, matri <dbl+lbl>,
## #   religion <dbl+lbl>, newspaper <dbl+lbl>, radio <dbl+lbl>, tv <dbl+lbl>,
## #   ideal_nb_children <dbl+lbl>, test <dbl+lbl>

Note: in R console, value labels (if defined) are usually printed but they do not appear in a R markdown document like this vignette.

dplyr::glimpse()

The function dplyr::glimpse() allows you to have a quick look at all the variables in a data frame.

glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4,...
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2,...
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa...
glimpse(women)
## Rows: 2,000
## Columns: 17
## $ id_woman          <dbl> 391, 1643, 85, 881, 1981, 1072, 1978, 1607, 738, ...
## $ id_household      <dbl> 381, 1515, 85, 844, 1797, 1015, 1794, 1486, 711, ...
## $ weight            <dbl> 1.803150, 1.803150, 1.803150, 1.803150, 1.803150,...
## $ interview_date    <date> 2012-05-05, 2012-01-23, 2012-01-21, 2012-01-06, ...
## $ date_of_birth     <date> 1997-03-07, 1982-01-06, 1979-01-01, 1968-03-29, ...
## $ age               <dbl> 15, 30, 33, 43, 25, 18, 45, 23, 49, 31, 26, 45, 2...
## $ residency         <dbl+lbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ region            <dbl+lbl> 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, ...
## $ instruction       <dbl+lbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, ...
## $ employed          <dbl+lbl> 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
## $ matri             <dbl+lbl> 0, 2, 2, 2, 1, 0, 1, 1, 2, 5, 2, 3, 0, 2, 1, ...
## $ religion          <dbl+lbl> 1, 3, 2, 3, 2, 2, 3, 1, 3, 3, 2, 3, 2, 2, 2, ...
## $ newspaper         <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ radio             <dbl+lbl> 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, ...
## $ tv                <dbl+lbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, ...
## $ ideal_nb_children <dbl+lbl>  4,  4,  4,  4,  4,  5, 10,  5,  4,  5,  6, 1...
## $ test              <dbl+lbl> 0, 9, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, ...

It will show you the first values of each variable as well as the type of each variable. However, some important informations are not displayed:

labelled::look_for()

look_for() provided by the labelled package will print in the console a data dictionnary of all variables, showing variable labels when available, the type of variable and a list of values corresponding to:

library(labelled)
look_for(iris)
## pos   variable     label           col_type values          
## <chr> <chr>        <chr>           <chr>    <chr>           
## 1     Sepal.Length —               dbl      range: 4.3 - 7.9
## 2     Sepal.Width  —               dbl      range: 2 - 4.4  
## 3     Petal.Length Length of petal dbl      range: 1 - 6.9  
## 4     Petal.Width  Width of Petal  dbl      range: 0.1 - 2.5
## 5     Species      —               fct      setosa          
## <U+200B>      <U+200B>             <U+200B>                <U+200B>         versicolor      
## <U+200B>      <U+200B>             <U+200B>                <U+200B>         virginica
look_for(women)
## pos   variable        label                     col_type values                 
## <chr> <chr>           <chr>                     <chr>    <chr>                  
## 1     id_woman        Woman Id                  dbl      range: 1 - 2000        
## 2     id_household    Household Id              dbl      range: 1 - 1814        
## 3     weight          Sample weight             dbl      range: 0.044629 - 4.39~
## 4     interview_date  Interview date            date     range: 2011-12-01 - 20~
## 5     date_of_birth   Date of birth             date     range: 1962-02-07 - 19~
## 6     age             Age at last anniversary ~ dbl      range: 14 - 49         
## 7     residency       Urban / rural residency   dbl+lbl  [1] urban              
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [2] rural              
## 8     region          Region                    dbl+lbl  [1] North              
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [2] East               
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [3] South              
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [4] West               
## 9     instruction     Level of instruction      dbl+lbl  [0] none               
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [1] primary            
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [2] secondary          
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [3] higher             
## 10    employed        Employed?                 dbl+lbl  [0] no                 
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [1] yes                
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [9] missing            
## 11    matri           Matrimonial status        dbl+lbl  [0] single             
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [1] married            
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [2] living together    
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [3] windowed           
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [4] divorced           
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [5] separated          
## 12    religion        Religion                  dbl+lbl  [1] Muslim             
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [2] Christian          
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [3] Protestant         
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [4] no religion        
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [5] other              
## 13    newspaper       Read newspaper?           dbl+lbl  [0] no                 
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [1] yes                
## 14    radio           Listen to radio?          dbl+lbl  [0] no                 
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [1] yes                
## 15    tv              Watch TV?                 dbl+lbl  [0] no                 
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [1] yes                
## 16    ideal_nb_child~ Ideal number of children  dbl+lbl  [96] don't know        
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [99] missing           
## 17    test            Ever tested for HIV?      dbl+lbl  [0] no                 
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [1] yes                
## <U+200B>      <U+200B>                <U+200B>                          <U+200B>         [9] missing

Note that lookfor() and generate_dictionary() are synonyms of look_for() and works exactly in the same way.

Searching variables by key

When a data frame has dozens or even hundreds of variables, it could become difficult to find a specific variable. In such case, you can provide an optional list of keywords, which can be simple character strings or regular expression, to search for specific variables.

# Look for a single keyword.
look_for(iris, "petal")
## pos   variable     label           col_type values          
## <chr> <chr>        <chr>           <chr>    <chr>           
## 3     Petal.Length Length of petal dbl      range: 1 - 6.9  
## 4     Petal.Width  Width of Petal  dbl      range: 0.1 - 2.5
look_for(iris, "s")
## pos   variable     label col_type values          
## <chr> <chr>        <chr> <chr>    <chr>           
## 1     Sepal.Length —     dbl      range: 4.3 - 7.9
## 2     Sepal.Width  —     dbl      range: 2 - 4.4  
## 5     Species      —     fct      setosa          
## <U+200B>      <U+200B>             <U+200B>      <U+200B>         versicolor      
## <U+200B>      <U+200B>             <U+200B>      <U+200B>         virginica
# Look for with a regular expression
look_for(iris, "petal|species")
## pos   variable     label           col_type values          
## <chr> <chr>        <chr>           <chr>    <chr>           
## 3     Petal.Length Length of petal dbl      range: 1 - 6.9  
## 4     Petal.Width  Width of Petal  dbl      range: 0.1 - 2.5
## 5     Species      —               fct      setosa          
## <U+200B>      <U+200B>             <U+200B>                <U+200B>         versicolor      
## <U+200B>      <U+200B>             <U+200B>                <U+200B>         virginica
look_for(iris, "s$")
## pos   variable label col_type values    
## <chr> <chr>    <chr> <chr>    <chr>     
## 5     Species  —     fct      setosa    
## <U+200B>      <U+200B>         <U+200B>      <U+200B>         versicolor
## <U+200B>      <U+200B>         <U+200B>      <U+200B>         virginica
# Look for with several keywords
look_for(iris, "pet", "sp")
## pos   variable     label           col_type values          
## <chr> <chr>        <chr>           <chr>    <chr>           
## 3     Petal.Length Length of petal dbl      range: 1 - 6.9  
## 4     Petal.Width  Width of Petal  dbl      range: 0.1 - 2.5
## 5     Species      —               fct      setosa          
## <U+200B>      <U+200B>             <U+200B>                <U+200B>         versicolor      
## <U+200B>      <U+200B>             <U+200B>                <U+200B>         virginica
# Look_for will take variable labels into account
look_for(women, "read", "level")
## pos   variable    label                col_type values       
## <chr> <chr>       <chr>                <chr>    <chr>        
## 9     instruction Level of instruction dbl+lbl  [0] none     
## <U+200B>      <U+200B>            <U+200B>                     <U+200B>         [1] primary  
## <U+200B>      <U+200B>            <U+200B>                     <U+200B>         [2] secondary
## <U+200B>      <U+200B>            <U+200B>                     <U+200B>         [3] higher   
## 13    newspaper   Read newspaper?      dbl+lbl  [0] no       
## <U+200B>      <U+200B>            <U+200B>                     <U+200B>         [1] yes

By default, look_for() will look through both variable names and variables labels. Use labels = FALSE to look only through variable names.

look_for(women, "read")
## pos   variable  label           col_type values 
## <chr> <chr>     <chr>           <chr>    <chr>  
## 13    newspaper Read newspaper? dbl+lbl  [0] no 
## <U+200B>      <U+200B>          <U+200B>                <U+200B>         [1] yes
look_for(women, "read", labels = FALSE)
## Nothing found. Sorry.

Similarly, the search is by default case insensitive. To make the search case sensitive, use ignore.case = FALSE.

look_for(iris, "sepal")
## pos   variable     label col_type values          
## <chr> <chr>        <chr> <chr>    <chr>           
## 1     Sepal.Length —     dbl      range: 4.3 - 7.9
## 2     Sepal.Width  —     dbl      range: 2 - 4.4
look_for(iris, "sepal", ignore.case = FALSE)
## Nothing found. Sorry.

Advanced usages of look_for()

look_for() returns a detailed tibble which is summarized before printing. To deactivate default printing and see full results, simply use dplyr::as_tibble(), dplyr::glimpse() or even utils::View().

look_for(iris) %>% as_tibble()
## # A tibble: 5 x 13
##     pos variable label col_type class type  levels value_labels na_values
##   <int> <chr>    <chr> <chr>    <nam> <chr> <name> <named list> <named l>
## 1     1 Sepal.L~ <NA>  dbl      <chr~ doub~ <NULL> <NULL>       <NULL>   
## 2     2 Sepal.W~ <NA>  dbl      <chr~ doub~ <NULL> <NULL>       <NULL>   
## 3     3 Petal.L~ Leng~ dbl      <chr~ doub~ <NULL> <NULL>       <NULL>   
## 4     4 Petal.W~ Widt~ dbl      <chr~ doub~ <NULL> <NULL>       <NULL>   
## 5     5 Species  <NA>  fct      <chr~ inte~ <chr ~ <NULL>       <NULL>   
## # ... with 4 more variables: na_range <named list>, unique_values <int>,
## #   n_na <int>, range <named list>
glimpse(look_for(iris))
## Rows: 5
## Columns: 13
## $ pos           <int> 1, 2, 3, 4, 5
## $ variable      <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal...
## $ label         <chr> NA, NA, "Length of petal", "Width of Petal", NA
## $ col_type      <chr> "dbl", "dbl", "dbl", "dbl", "fct"
## $ class         <named list> ["numeric", "numeric", "numeric", "numeric", "...
## $ type          <chr> "double", "double", "double", "double", "integer"
## $ levels        <named list> [NULL, NULL, NULL, NULL, <"setosa", "versicolo...
## $ value_labels  <named list> [NULL, NULL, NULL, NULL, NULL]
## $ na_values     <named list> [NULL, NULL, NULL, NULL, NULL]
## $ na_range      <named list> [NULL, NULL, NULL, NULL, NULL]
## $ unique_values <int> 35, 23, 43, 22, 3
## $ n_na          <int> 0, 0, 0, 0, 0
## $ range         <named list> [<4.3, 7.9>, <2.0, 4.4>, <1.0, 6.9>, <0.1, 2.5...

The tibble returned by look_for() could be easily manipulated for advanced programming.

When a column has several values for one variable (e.g. levels or value_labels), results as stored with nested named list. You can convert named lists into simpler character vectors, you can use convert_list_columns_to_character().

look_for(iris) %>% convert_list_columns_to_character()
## # A tibble: 5 x 13
##     pos variable label col_type class type  levels value_labels na_values
##   <int> <chr>    <chr> <chr>    <chr> <chr> <chr>  <chr>        <chr>    
## 1     1 Sepal.L~ <NA>  dbl      nume~ doub~ ""     ""           ""       
## 2     2 Sepal.W~ <NA>  dbl      nume~ doub~ ""     ""           ""       
## 3     3 Petal.L~ Leng~ dbl      nume~ doub~ ""     ""           ""       
## 4     4 Petal.W~ Widt~ dbl      nume~ doub~ ""     ""           ""       
## 5     5 Species  <NA>  fct      fact~ inte~ "seto~ ""           ""       
## # ... with 4 more variables: na_range <chr>, unique_values <int>, n_na <int>,
## #   range <chr>

Alternatively, you can use lookfor_to_long_format() to transform results into a long format with one row per factor level and per value label.

look_for(iris) %>% lookfor_to_long_format()
## # A tibble: 7 x 13
##     pos variable label col_type class type  levels value_labels na_values
##   <int> <chr>    <chr> <chr>    <nam> <chr> <chr>  <chr>        <named l>
## 1     1 Sepal.L~ <NA>  dbl      <chr~ doub~ <NA>   <NA>         <NULL>   
## 2     2 Sepal.W~ <NA>  dbl      <chr~ doub~ <NA>   <NA>         <NULL>   
## 3     3 Petal.L~ Leng~ dbl      <chr~ doub~ <NA>   <NA>         <NULL>   
## 4     4 Petal.W~ Widt~ dbl      <chr~ doub~ <NA>   <NA>         <NULL>   
## 5     5 Species  <NA>  fct      <chr~ inte~ setosa <NA>         <NULL>   
## 6     5 Species  <NA>  fct      <chr~ inte~ versi~ <NA>         <NULL>   
## 7     5 Species  <NA>  fct      <chr~ inte~ virgi~ <NA>         <NULL>   
## # ... with 4 more variables: na_range <named list>, unique_values <int>,
## #   n_na <int>, range <named list>

Both can be combined:

look_for(women) %>%
  lookfor_to_long_format() %>%
  convert_list_columns_to_character()
## # A tibble: 41 x 13
##      pos variable label col_type class type  levels value_labels na_values
##    <int> <chr>    <chr> <chr>    <chr> <chr> <chr>  <chr>        <chr>    
##  1     1 id_woman Woma~ dbl      nume~ doub~ <NA>   <NA>         ""       
##  2     2 id_hous~ Hous~ dbl      nume~ doub~ <NA>   <NA>         ""       
##  3     3 weight   Samp~ dbl      nume~ doub~ <NA>   <NA>         ""       
##  4     4 intervi~ Inte~ date     Date  doub~ <NA>   <NA>         ""       
##  5     5 date_of~ Date~ date     Date  doub~ <NA>   <NA>         ""       
##  6     6 age      Age ~ dbl      nume~ doub~ <NA>   <NA>         ""       
##  7     7 residen~ Urba~ dbl+lbl  have~ doub~ <NA>   [1] urban    ""       
##  8     7 residen~ Urba~ dbl+lbl  have~ doub~ <NA>   [2] rural    ""       
##  9     8 region   Regi~ dbl+lbl  have~ doub~ <NA>   [1] North    ""       
## 10     8 region   Regi~ dbl+lbl  have~ doub~ <NA>   [2] East     ""       
## # ... with 31 more rows, and 4 more variables: na_range <chr>,
## #   unique_values <int>, n_na <int>, range <chr>

If you just want to use the search feature of look_for() without computing the details of each variable, simply indicate details = FALSE.

look_for(women, "id", details = FALSE)
##   pos variable          label                   
## <int> <chr>             <chr>                   
##     1 id_woman          Woman Id                
##     2 id_household      Household Id            
##     7 residency         Urban / rural residency 
##    16 ideal_nb_children Ideal number of children