look_for()
It is a common need to easily get a description of all variables in a data frame.
When a data frame is converted into a tibble (e.g. with dplyr::as_tibble()
), it as a nice printing showing the first rows of the data frame as well as the type of column.
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
However, when you have too many variables, all of them cannot be printed and their are just listed.
## # A tibble: 2,000 x 17
## id_woman id_household weight interview_date date_of_birth age residency
## <dbl> <dbl> <dbl> <date> <date> <dbl> <dbl+lbl>
## 1 391 381 1.80 2012-05-05 1997-03-07 15 2 [rural]
## 2 1643 1515 1.80 2012-01-23 1982-01-06 30 2 [rural]
## 3 85 85 1.80 2012-01-21 1979-01-01 33 2 [rural]
## 4 881 844 1.80 2012-01-06 1968-03-29 43 2 [rural]
## 5 1981 1797 1.80 2012-05-11 1986-05-25 25 2 [rural]
## 6 1072 1015 0.998 2012-02-20 1993-07-03 18 2 [rural]
## 7 1978 1794 0.998 2012-02-23 1967-01-28 45 2 [rural]
## 8 1607 1486 0.998 2012-02-20 1989-01-21 23 2 [rural]
## 9 738 711 0.192 2012-03-09 1962-07-24 49 2 [rural]
## 10 1656 1525 0.192 2012-03-15 1980-12-25 31 2 [rural]
## # ... with 1,990 more rows, and 10 more variables: region <dbl+lbl>,
## # instruction <dbl+lbl>, employed <dbl+lbl>, matri <dbl+lbl>,
## # religion <dbl+lbl>, newspaper <dbl+lbl>, radio <dbl+lbl>, tv <dbl+lbl>,
## # ideal_nb_children <dbl+lbl>, test <dbl+lbl>
Note: in R console, value labels (if defined) are usually printed but they do not appear in a R markdown document like this vignette.
dplyr::glimpse()
The function dplyr::glimpse()
allows you to have a quick look at all the variables in a data frame.
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4,...
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5,...
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2,...
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa...
## Rows: 2,000
## Columns: 17
## $ id_woman <dbl> 391, 1643, 85, 881, 1981, 1072, 1978, 1607, 738, ...
## $ id_household <dbl> 381, 1515, 85, 844, 1797, 1015, 1794, 1486, 711, ...
## $ weight <dbl> 1.803150, 1.803150, 1.803150, 1.803150, 1.803150,...
## $ interview_date <date> 2012-05-05, 2012-01-23, 2012-01-21, 2012-01-06, ...
## $ date_of_birth <date> 1997-03-07, 1982-01-06, 1979-01-01, 1968-03-29, ...
## $ age <dbl> 15, 30, 33, 43, 25, 18, 45, 23, 49, 31, 26, 45, 2...
## $ residency <dbl+lbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ region <dbl+lbl> 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, ...
## $ instruction <dbl+lbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, ...
## $ employed <dbl+lbl> 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
## $ matri <dbl+lbl> 0, 2, 2, 2, 1, 0, 1, 1, 2, 5, 2, 3, 0, 2, 1, ...
## $ religion <dbl+lbl> 1, 3, 2, 3, 2, 2, 3, 1, 3, 3, 2, 3, 2, 2, 2, ...
## $ newspaper <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ radio <dbl+lbl> 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, ...
## $ tv <dbl+lbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, ...
## $ ideal_nb_children <dbl+lbl> 4, 4, 4, 4, 4, 5, 10, 5, 4, 5, 6, 1...
## $ test <dbl+lbl> 0, 9, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, ...
It will show you the first values of each variable as well as the type of each variable. However, some important informations are not displayed:
labelled::look_for()
look_for()
provided by the labelled
package will print in the console a data dictionnary of all variables, showing variable labels when available, the type of variable and a list of values corresponding to:
## pos variable label col_type values
## <chr> <chr> <chr> <chr> <chr>
## 1 Sepal.Length — dbl range: 4.3 - 7.9
## 2 Sepal.Width — dbl range: 2 - 4.4
## 3 Petal.Length Length of petal dbl range: 1 - 6.9
## 4 Petal.Width Width of Petal dbl range: 0.1 - 2.5
## 5 Species — fct setosa
## <U+200B> <U+200B> <U+200B> <U+200B> versicolor
## <U+200B> <U+200B> <U+200B> <U+200B> virginica
## pos variable label col_type values
## <chr> <chr> <chr> <chr> <chr>
## 1 id_woman Woman Id dbl range: 1 - 2000
## 2 id_household Household Id dbl range: 1 - 1814
## 3 weight Sample weight dbl range: 0.044629 - 4.39~
## 4 interview_date Interview date date range: 2011-12-01 - 20~
## 5 date_of_birth Date of birth date range: 1962-02-07 - 19~
## 6 age Age at last anniversary ~ dbl range: 14 - 49
## 7 residency Urban / rural residency dbl+lbl [1] urban
## <U+200B> <U+200B> <U+200B> <U+200B> [2] rural
## 8 region Region dbl+lbl [1] North
## <U+200B> <U+200B> <U+200B> <U+200B> [2] East
## <U+200B> <U+200B> <U+200B> <U+200B> [3] South
## <U+200B> <U+200B> <U+200B> <U+200B> [4] West
## 9 instruction Level of instruction dbl+lbl [0] none
## <U+200B> <U+200B> <U+200B> <U+200B> [1] primary
## <U+200B> <U+200B> <U+200B> <U+200B> [2] secondary
## <U+200B> <U+200B> <U+200B> <U+200B> [3] higher
## 10 employed Employed? dbl+lbl [0] no
## <U+200B> <U+200B> <U+200B> <U+200B> [1] yes
## <U+200B> <U+200B> <U+200B> <U+200B> [9] missing
## 11 matri Matrimonial status dbl+lbl [0] single
## <U+200B> <U+200B> <U+200B> <U+200B> [1] married
## <U+200B> <U+200B> <U+200B> <U+200B> [2] living together
## <U+200B> <U+200B> <U+200B> <U+200B> [3] windowed
## <U+200B> <U+200B> <U+200B> <U+200B> [4] divorced
## <U+200B> <U+200B> <U+200B> <U+200B> [5] separated
## 12 religion Religion dbl+lbl [1] Muslim
## <U+200B> <U+200B> <U+200B> <U+200B> [2] Christian
## <U+200B> <U+200B> <U+200B> <U+200B> [3] Protestant
## <U+200B> <U+200B> <U+200B> <U+200B> [4] no religion
## <U+200B> <U+200B> <U+200B> <U+200B> [5] other
## 13 newspaper Read newspaper? dbl+lbl [0] no
## <U+200B> <U+200B> <U+200B> <U+200B> [1] yes
## 14 radio Listen to radio? dbl+lbl [0] no
## <U+200B> <U+200B> <U+200B> <U+200B> [1] yes
## 15 tv Watch TV? dbl+lbl [0] no
## <U+200B> <U+200B> <U+200B> <U+200B> [1] yes
## 16 ideal_nb_child~ Ideal number of children dbl+lbl [96] don't know
## <U+200B> <U+200B> <U+200B> <U+200B> [99] missing
## 17 test Ever tested for HIV? dbl+lbl [0] no
## <U+200B> <U+200B> <U+200B> <U+200B> [1] yes
## <U+200B> <U+200B> <U+200B> <U+200B> [9] missing
Note that lookfor()
and generate_dictionary()
are synonyms of look_for()
and works exactly in the same way.
When a data frame has dozens or even hundreds of variables, it could become difficult to find a specific variable. In such case, you can provide an optional list of keywords, which can be simple character strings or regular expression, to search for specific variables.
## pos variable label col_type values
## <chr> <chr> <chr> <chr> <chr>
## 3 Petal.Length Length of petal dbl range: 1 - 6.9
## 4 Petal.Width Width of Petal dbl range: 0.1 - 2.5
## pos variable label col_type values
## <chr> <chr> <chr> <chr> <chr>
## 1 Sepal.Length — dbl range: 4.3 - 7.9
## 2 Sepal.Width — dbl range: 2 - 4.4
## 5 Species — fct setosa
## <U+200B> <U+200B> <U+200B> <U+200B> versicolor
## <U+200B> <U+200B> <U+200B> <U+200B> virginica
## pos variable label col_type values
## <chr> <chr> <chr> <chr> <chr>
## 3 Petal.Length Length of petal dbl range: 1 - 6.9
## 4 Petal.Width Width of Petal dbl range: 0.1 - 2.5
## 5 Species — fct setosa
## <U+200B> <U+200B> <U+200B> <U+200B> versicolor
## <U+200B> <U+200B> <U+200B> <U+200B> virginica
## pos variable label col_type values
## <chr> <chr> <chr> <chr> <chr>
## 5 Species — fct setosa
## <U+200B> <U+200B> <U+200B> <U+200B> versicolor
## <U+200B> <U+200B> <U+200B> <U+200B> virginica
## pos variable label col_type values
## <chr> <chr> <chr> <chr> <chr>
## 3 Petal.Length Length of petal dbl range: 1 - 6.9
## 4 Petal.Width Width of Petal dbl range: 0.1 - 2.5
## 5 Species — fct setosa
## <U+200B> <U+200B> <U+200B> <U+200B> versicolor
## <U+200B> <U+200B> <U+200B> <U+200B> virginica
## pos variable label col_type values
## <chr> <chr> <chr> <chr> <chr>
## 9 instruction Level of instruction dbl+lbl [0] none
## <U+200B> <U+200B> <U+200B> <U+200B> [1] primary
## <U+200B> <U+200B> <U+200B> <U+200B> [2] secondary
## <U+200B> <U+200B> <U+200B> <U+200B> [3] higher
## 13 newspaper Read newspaper? dbl+lbl [0] no
## <U+200B> <U+200B> <U+200B> <U+200B> [1] yes
By default, look_for()
will look through both variable names and variables labels. Use labels = FALSE
to look only through variable names.
## pos variable label col_type values
## <chr> <chr> <chr> <chr> <chr>
## 13 newspaper Read newspaper? dbl+lbl [0] no
## <U+200B> <U+200B> <U+200B> <U+200B> [1] yes
## Nothing found. Sorry.
Similarly, the search is by default case insensitive. To make the search case sensitive, use ignore.case = FALSE
.
## pos variable label col_type values
## <chr> <chr> <chr> <chr> <chr>
## 1 Sepal.Length — dbl range: 4.3 - 7.9
## 2 Sepal.Width — dbl range: 2 - 4.4
## Nothing found. Sorry.
look_for()
look_for()
returns a detailed tibble which is summarized before printing. To deactivate default printing and see full results, simply use dplyr::as_tibble()
, dplyr::glimpse()
or even utils::View()
.
## # A tibble: 5 x 13
## pos variable label col_type class type levels value_labels na_values
## <int> <chr> <chr> <chr> <nam> <chr> <name> <named list> <named l>
## 1 1 Sepal.L~ <NA> dbl <chr~ doub~ <NULL> <NULL> <NULL>
## 2 2 Sepal.W~ <NA> dbl <chr~ doub~ <NULL> <NULL> <NULL>
## 3 3 Petal.L~ Leng~ dbl <chr~ doub~ <NULL> <NULL> <NULL>
## 4 4 Petal.W~ Widt~ dbl <chr~ doub~ <NULL> <NULL> <NULL>
## 5 5 Species <NA> fct <chr~ inte~ <chr ~ <NULL> <NULL>
## # ... with 4 more variables: na_range <named list>, unique_values <int>,
## # n_na <int>, range <named list>
## Rows: 5
## Columns: 13
## $ pos <int> 1, 2, 3, 4, 5
## $ variable <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal...
## $ label <chr> NA, NA, "Length of petal", "Width of Petal", NA
## $ col_type <chr> "dbl", "dbl", "dbl", "dbl", "fct"
## $ class <named list> ["numeric", "numeric", "numeric", "numeric", "...
## $ type <chr> "double", "double", "double", "double", "integer"
## $ levels <named list> [NULL, NULL, NULL, NULL, <"setosa", "versicolo...
## $ value_labels <named list> [NULL, NULL, NULL, NULL, NULL]
## $ na_values <named list> [NULL, NULL, NULL, NULL, NULL]
## $ na_range <named list> [NULL, NULL, NULL, NULL, NULL]
## $ unique_values <int> 35, 23, 43, 22, 3
## $ n_na <int> 0, 0, 0, 0, 0
## $ range <named list> [<4.3, 7.9>, <2.0, 4.4>, <1.0, 6.9>, <0.1, 2.5...
The tibble returned by look_for()
could be easily manipulated for advanced programming.
When a column has several values for one variable (e.g. levels
or value_labels
), results as stored with nested named list. You can convert named lists into simpler character vectors, you can use convert_list_columns_to_character()
.
## # A tibble: 5 x 13
## pos variable label col_type class type levels value_labels na_values
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Sepal.L~ <NA> dbl nume~ doub~ "" "" ""
## 2 2 Sepal.W~ <NA> dbl nume~ doub~ "" "" ""
## 3 3 Petal.L~ Leng~ dbl nume~ doub~ "" "" ""
## 4 4 Petal.W~ Widt~ dbl nume~ doub~ "" "" ""
## 5 5 Species <NA> fct fact~ inte~ "seto~ "" ""
## # ... with 4 more variables: na_range <chr>, unique_values <int>, n_na <int>,
## # range <chr>
Alternatively, you can use lookfor_to_long_format()
to transform results into a long format with one row per factor level and per value label.
## # A tibble: 7 x 13
## pos variable label col_type class type levels value_labels na_values
## <int> <chr> <chr> <chr> <nam> <chr> <chr> <chr> <named l>
## 1 1 Sepal.L~ <NA> dbl <chr~ doub~ <NA> <NA> <NULL>
## 2 2 Sepal.W~ <NA> dbl <chr~ doub~ <NA> <NA> <NULL>
## 3 3 Petal.L~ Leng~ dbl <chr~ doub~ <NA> <NA> <NULL>
## 4 4 Petal.W~ Widt~ dbl <chr~ doub~ <NA> <NA> <NULL>
## 5 5 Species <NA> fct <chr~ inte~ setosa <NA> <NULL>
## 6 5 Species <NA> fct <chr~ inte~ versi~ <NA> <NULL>
## 7 5 Species <NA> fct <chr~ inte~ virgi~ <NA> <NULL>
## # ... with 4 more variables: na_range <named list>, unique_values <int>,
## # n_na <int>, range <named list>
Both can be combined:
## # A tibble: 41 x 13
## pos variable label col_type class type levels value_labels na_values
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 id_woman Woma~ dbl nume~ doub~ <NA> <NA> ""
## 2 2 id_hous~ Hous~ dbl nume~ doub~ <NA> <NA> ""
## 3 3 weight Samp~ dbl nume~ doub~ <NA> <NA> ""
## 4 4 intervi~ Inte~ date Date doub~ <NA> <NA> ""
## 5 5 date_of~ Date~ date Date doub~ <NA> <NA> ""
## 6 6 age Age ~ dbl nume~ doub~ <NA> <NA> ""
## 7 7 residen~ Urba~ dbl+lbl have~ doub~ <NA> [1] urban ""
## 8 7 residen~ Urba~ dbl+lbl have~ doub~ <NA> [2] rural ""
## 9 8 region Regi~ dbl+lbl have~ doub~ <NA> [1] North ""
## 10 8 region Regi~ dbl+lbl have~ doub~ <NA> [2] East ""
## # ... with 31 more rows, and 4 more variables: na_range <chr>,
## # unique_values <int>, n_na <int>, range <chr>
If you just want to use the search feature of look_for()
without computing the details of each variable, simply indicate details = FALSE
.
## pos variable label
## <int> <chr> <chr>
## 1 id_woman Woman Id
## 2 id_household Household Id
## 7 residency Urban / rural residency
## 16 ideal_nb_children Ideal number of children