# Small Multiples

#### 2018-01-20

This vignette for the unpivotr package demonstrates unpivoting multiple similar tables from a spreadsheet via the tidyxl package. It is best read with the spreadsheet open in a spreadsheet program, e.g. Excel, LibreOffice Calc or Gnumeric.

## Introduction

The spreadsheet is from the famous Enron subpoena, made available by Felienne Hermans, and has has previously been publicised by Jenny Bryan and David Robinson, in particular in Robinson’s article ‘Tidying an untidyable dataset’.

Here’s a screenshot:

knitr::include_graphics("enron-screenshot.png")

## Preparation

This vignette uses several common packages.

library(unpivotr)
library(tidyxl)
library(dplyr)
library(tidyr)
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:rvest':
##
##     pluck
library(stringr)

The spreadsheet is distributed with the unpivotr package, so can be loaded as a system file.

path <- system.file("extdata/enron.xlsx", package = "unpivotr")

## Main

### Importing the data

Spreadsheet cells are imported with the xlsx_cells() function, which returns a data frame of all the cells in all the requested sheets. By default, every sheet is imported, but we don’t have to worry about that in this case because there is only one sheet in the file.

cells <- xlsx_cells(path)

Cell formatting isn’t required for this vignette, but if it were, it would be imported via xlsx_formats(path).

formatting <- xlsx_formats(path)

### Importing one of the multiples

The small multiples each have exactly one ‘Fixed Price’ header, so begin by selecting one of those.

fixed_price <- filter(cells, character == "Fixed Price")[1, ]

From that single cell, selct the three rows of the column headers of the first small multiple. The split() function below separates each row from one another, wrapping them together in a list. They are separated so that they can individually be joined to the data cells later. Another way to do this is to select each row one by one, assigning them to different variables.

col_headers <-
fixed_price %>%
offset_N(cells, n = 1) %>%    # Offset up one row to "IF NWPL Rocky Mountains"
extend_E(cells, 3) %>%        # Extend to the right edge of the table
extend_S(cells, 2) %>%        # Extend down to the third row of the headers
filter(!is_blank) %>%   # Remove blanks
select(row, col, value = character) %>% # Prepare for joining to data cells
split(.$row) # Separate the row of headers into list elements col_headers ##$14
## # A tibble: 1 x 3
##     row   col value
##   <int> <int> <chr>
## 1    14    17 IF NWPL Rocky Mountains
##
## $15 ## # A tibble: 2 x 3 ## row col value ## <int> <int> <chr> ## 1 15 17 Fixed Price ## 2 15 19 Basis ## ##$16
## # A tibble: 4 x 3
##     row   col value
##   <int> <int> <chr>
## 1    16    17 BID
## 2    16    18 OFFER
## 3    16    19 BID
## 4    16    20 OFFER

Now select the data cells, starting from the ‘Fixed Price’ header again.

datacells <-
fixed_price %>%
offset_S(cells, n = 2) %>%
extend_E(cells, 4) %>%
extend_S(cells,                       # Extend down to a blank row
boundary = ~ is_blank, # The formula detects blank cells
edge = TRUE) %>%          # Require the whole row to be blank
filter(!is_blank) %>%           # Remove remaining blanks
mutate(value = as.double(numeric)) %>%# Convert the values to double
select(row, col, value)               # Prepare for joining to headers
print(datacells, n = Inf)
## # A tibble: 24 x 3
##      row   col  value
##    <int> <int>  <dbl>
##  1    18    17  2.06
##  2    18    18  2.08
##  3    19    17  2.40
##  4    19    18  2.42
##  5    19    19 -0.565
##  6    19    20 -0.545
##  7    20    17  2.59
##  8    20    18  2.61
##  9    20    19 -0.494
## 10    20    20 -0.474
## 11    21    17  2.58
## 12    21    18  2.60
## 13    21    19 -0.585
## 14    21    20 -0.565
## 15    22    17  3.36
## 16    22    18  3.38
## 17    22    19 -0.295
## 18    22    20 -0.275
## 19    23    17  2.63
## 20    23    18  2.65
## 21    23    19 -0.530
## 22    23    20 -0.510
## 23    17    18  1.91
## 24    17    17  1.89

Finally, bind the data cells to the column headers (this is the real magic). For more examples of how the compass directions work (the NNW() and N() functions below), see the vignette called Compass Directions.

datacells %>%
NNW(col_headers[[1]]) %>% # This header isn't in every column
NNW(col_headers[[2]]) %>%  # Nor is this header
N(col_headers[[3]])    # But this one is
## # A tibble: 24 x 6
##      row   col value.data  i.value                 i.value.1 value.header
##    <int> <int> <chr>       <chr>                       <dbl> <chr>
##  1    18    17 Fixed Price IF NWPL Rocky Mountains     2.06  BID
##  2    18    18 Fixed Price IF NWPL Rocky Mountains     2.08  OFFER
##  3    19    17 Fixed Price IF NWPL Rocky Mountains     2.40  BID
##  4    19    18 Fixed Price IF NWPL Rocky Mountains     2.42  OFFER
##  5    19    19 Basis       IF NWPL Rocky Mountains    -0.565 BID
##  6    19    20 Basis       IF NWPL Rocky Mountains    -0.545 OFFER
##  7    20    17 Fixed Price IF NWPL Rocky Mountains     2.59  BID
##  8    20    18 Fixed Price IF NWPL Rocky Mountains     2.61  OFFER
##  9    20    19 Basis       IF NWPL Rocky Mountains    -0.494 BID
## 10    20    20 Basis       IF NWPL Rocky Mountains    -0.474 OFFER
## # ... with 14 more rows

### Importing every small multiple at once

The code above, for a single multiple, can easily be adapted to import every one of the small multiples. Here this is done using the purrr() package to apply the code to each element of a list of ‘Fixed Price’ header cells.

Get all ten ‘Fixed Price’ headers, and separate each into its own list element. Since each cell has a unique combination of row and col, that combination can be used to separate the cells into list elements.

fixed_price <-
cells %>%
filter(character == "Fixed Price") %>%
split(paste(.$row, .$col))

Adapt the code for a single multiple a ‘tidy’ function to tidy a general small multiple, starting from the ‘Fixed Price’ header. Here this is done by substituting x for fixed_price, and wrapping the three sections of code in a function. Everything else is the same.

tidy <- function(x) {
col_headers <-
x %>%
offset_N(cells, n = 1) %>%
extend_E(cells, 3) %>%
extend_S(cells, 2) %>%
filter(!is_blank) %>%
select(row, col, value = character) %>%
split(.$row) datacells <- x %>% offset_S(cells, n = 2) %>% extend_E(cells, 4) %>% extend_S(cells, boundary = ~ is.na(numeric), edge = TRUE) %>% filter(!is.na(numeric)) %>% select(row, col, value = numeric) datacells %>% NNW(col_headers[[1]]) %>% NNW(col_headers[[2]]) %>% N(col_headers[[3]]) } Finally, map the ‘tidy’ function to each ‘Fixed Price’ header, and bind the results. into one data frame. The purrr() package is used here, but this could also be done with the apply() family of functions. map_df(fixed_price, tidy) %>% arrange(col, row) # See that, from row 39, the region changes, as it ought. ## # A tibble: 240 x 6 ## row col value.data i.value i.value.1 value.header ## <int> <int> <chr> <chr> <dbl> <chr> ## 1 28 7 Fixed Price IF CIG Rocky Mountains 1.94 BID ## 2 29 7 Fixed Price IF CIG Rocky Mountains 1.96 BID ## 3 30 7 Fixed Price IF CIG Rocky Mountains 2.35 BID ## 4 31 7 Fixed Price IF CIG Rocky Mountains 2.55 BID ## 5 32 7 Fixed Price IF CIG Rocky Mountains 2.47 BID ## 6 33 7 Fixed Price IF CIG Rocky Mountains 3.31 BID ## 7 34 7 Fixed Price IF CIG Rocky Mountains 2.55 BID ## 8 39 7 Fixed Price AECO / NIT 2.38 BID ## 9 40 7 Fixed Price AECO / NIT 2.40 BID ## 10 41 7 Fixed Price AECO / NIT 2.55 BID ## # ... with 230 more rows ### Joining the row headers So far, only the column headers have been joined, but there are also row headers on the left-hand side of the spreadsheet. The following code incorporates these into the final dataset. row_headers <- cells %>% filter(between(row, 17, 56), between(col, 2, 4), !is_blank) %>% # Concatenate rows like "Dec-01", "to", "Mar-02" mutate(character = ifelse(!is.na(character), character, format(date, origin="1899-12-30", "%b-%y"))) %>% select(row, col, character) %>% nest(-row) %>% mutate(data = map(data, ~ paste(.x$character, collapse = " "))) %>%
unnest() %>%
mutate(col = 2L) %>%
select(row, col, value = data)

Since the single column of row headers applies to every row of every small multiple (unlike the column headers), a global row_headers variable can be joined to each small multiple by using a simple W() join. This is incorporated into the definition of tidy() below (see the bottom line).

tidy <- function(x) {
col_headers <-
x %>%
offset_N(cells, n = 1) %>%
extend_E(cells, 3) %>%
extend_S(cells, 2) %>%
filter(!is_blank) %>%
select(row, col, value = character) %>%
split(.$row) datacells <- x %>% offset_S(cells, n = 2) %>% extend_E(cells, 4) %>% extend_S(cells, boundary = ~ is.na(numeric), edge = TRUE) %>% filter(!is_blank) %>% mutate(value = numeric) %>% select(row, col, value) datacells %>% NNW(col_headers[[1]]) %>% NNW(col_headers[[2]]) %>% N(col_headers[[3]]) %>% W(row_headers) # This is the only new line } map_df(fixed_price, tidy) %>% arrange(col, row) # See that, from row 39, the context loops, as it ought. ## # A tibble: 240 x 7 ## row col value.data i.value i.value.1 value.header value ## <int> <int> <chr> <chr> <dbl> <chr> <chr> ## 1 28 7 Fixed Price IF CIG Rocky … 1.94 BID Cash ## 2 29 7 Fixed Price IF CIG Rocky … 1.96 BID ROM ## 3 30 7 Fixed Price IF CIG Rocky … 2.35 BID Dec-01 ## 4 31 7 Fixed Price IF CIG Rocky … 2.55 BID Dec-01 t… ## 5 32 7 Fixed Price IF CIG Rocky … 2.47 BID Apr-02 t… ## 6 33 7 Fixed Price IF CIG Rocky … 3.31 BID Nov-02 t… ## 7 34 7 Fixed Price IF CIG Rocky … 2.55 BID One Year… ## 8 39 7 Fixed Price AECO / NIT 2.38 BID Cash ## 9 40 7 Fixed Price AECO / NIT 2.40 BID ROM ## 10 41 7 Fixed Price AECO / NIT 2.55 BID Dec-01 ## # ... with 230 more rows ## 50-line code listing library(unpivotr) library(dplyr) library(tidyr) library(purrr) library(stringr) path <- system.file("extdata/enron.xlsx", package = "unpivotr") cells <- xlsx_cells(path) fixed_price <- cells %>% filter(character == "Fixed Price") %>% split(paste(.$row, .$col)) row_headers <- cells %>% filter(between(row, 17, 56), between(col, 2, 4), !is_blank) %>% # Concatenate rows like "Dec-01", "to", "Mar-02" mutate(character = ifelse(!is.na(character), character, format(date, "%b-%y"))) %>% select(row, col, character) %>% nest(-row) %>% mutate(value = map(data, ~ paste(.x$character, collapse = " ")),
col = 2L) %>%
select(row, col, value)

tidy <- function(x) {
col_headers <-
x %>%
offset_N(cells, n = 1) %>%
extend_E(cells, 3) %>%
extend_S(cells, 2) %>%
filter(!is_blank) %>%
select(row, col, value = character) %>%
split(.\$row)
datacells <-
x %>%
offset_S(cells, n = 2) %>%
extend_E(cells, 4) %>%
extend_S(cells,
boundary = ~ is.na(numeric),
edge = TRUE) %>%
filter(!is_blank) %>%
mutate(value = numeric) %>%
select(row, col, value)
datacells %>%
NNW(col_headers[[1]]) %>%
NNW(col_headers[[2]]) %>%
N(col_headers[[3]]) %>%
W(row_headers) # This is the only new line
}

map_df(fixed_price, tidy)