Small Multiples

Duncan Garmonsway

2018-01-20

This vignette for the unpivotr package demonstrates unpivoting multiple similar tables from a spreadsheet via the tidyxl package. It is best read with the spreadsheet open in a spreadsheet program, e.g. Excel, LibreOffice Calc or Gnumeric.

Introduction

The spreadsheet is from the famous Enron subpoena, made available by Felienne Hermans, and has has previously been publicised by Jenny Bryan and David Robinson, in particular in Robinson’s article ‘Tidying an untidyable dataset’.

Here’s a screenshot:

knitr::include_graphics("enron-screenshot.png")

Preparation

This vignette uses several common packages.

library(unpivotr)
library(tidyxl)
library(dplyr)
library(tidyr)
library(purrr)
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:rvest':
## 
##     pluck
library(stringr)

The spreadsheet is distributed with the unpivotr package, so can be loaded as a system file.

path <- system.file("extdata/enron.xlsx", package = "unpivotr")

Main

Importing the data

Spreadsheet cells are imported with the xlsx_cells() function, which returns a data frame of all the cells in all the requested sheets. By default, every sheet is imported, but we don’t have to worry about that in this case because there is only one sheet in the file.

Cell formatting isn’t required for this vignette, but if it were, it would be imported via xlsx_formats(path).

Importing one of the multiples

The small multiples each have exactly one ‘Fixed Price’ header, so begin by selecting one of those.

From that single cell, selct the three rows of the column headers of the first small multiple. The split() function below separates each row from one another, wrapping them together in a list. They are separated so that they can individually be joined to the data cells later. Another way to do this is to select each row one by one, assigning them to different variables.

## $`14`
## # A tibble: 1 x 3
##     row   col value                  
##   <int> <int> <chr>                  
## 1    14    17 IF NWPL Rocky Mountains
## 
## $`15`
## # A tibble: 2 x 3
##     row   col value      
##   <int> <int> <chr>      
## 1    15    17 Fixed Price
## 2    15    19 Basis      
## 
## $`16`
## # A tibble: 4 x 3
##     row   col value
##   <int> <int> <chr>
## 1    16    17 BID  
## 2    16    18 OFFER
## 3    16    19 BID  
## 4    16    20 OFFER

Now select the data cells, starting from the ‘Fixed Price’ header again.

## # A tibble: 24 x 3
##      row   col  value
##    <int> <int>  <dbl>
##  1    18    17  2.06 
##  2    18    18  2.08 
##  3    19    17  2.40 
##  4    19    18  2.42 
##  5    19    19 -0.565
##  6    19    20 -0.545
##  7    20    17  2.59 
##  8    20    18  2.61 
##  9    20    19 -0.494
## 10    20    20 -0.474
## 11    21    17  2.58 
## 12    21    18  2.60 
## 13    21    19 -0.585
## 14    21    20 -0.565
## 15    22    17  3.36 
## 16    22    18  3.38 
## 17    22    19 -0.295
## 18    22    20 -0.275
## 19    23    17  2.63 
## 20    23    18  2.65 
## 21    23    19 -0.530
## 22    23    20 -0.510
## 23    17    18  1.91 
## 24    17    17  1.89

Finally, bind the data cells to the column headers (this is the real magic). For more examples of how the compass directions work (the NNW() and N() functions below), see the vignette called Compass Directions.

## # A tibble: 24 x 6
##      row   col value.data  i.value                 i.value.1 value.header
##    <int> <int> <chr>       <chr>                       <dbl> <chr>       
##  1    18    17 Fixed Price IF NWPL Rocky Mountains     2.06  BID         
##  2    18    18 Fixed Price IF NWPL Rocky Mountains     2.08  OFFER       
##  3    19    17 Fixed Price IF NWPL Rocky Mountains     2.40  BID         
##  4    19    18 Fixed Price IF NWPL Rocky Mountains     2.42  OFFER       
##  5    19    19 Basis       IF NWPL Rocky Mountains    -0.565 BID         
##  6    19    20 Basis       IF NWPL Rocky Mountains    -0.545 OFFER       
##  7    20    17 Fixed Price IF NWPL Rocky Mountains     2.59  BID         
##  8    20    18 Fixed Price IF NWPL Rocky Mountains     2.61  OFFER       
##  9    20    19 Basis       IF NWPL Rocky Mountains    -0.494 BID         
## 10    20    20 Basis       IF NWPL Rocky Mountains    -0.474 OFFER       
## # ... with 14 more rows

Importing every small multiple at once

The code above, for a single multiple, can easily be adapted to import every one of the small multiples. Here this is done using the purrr() package to apply the code to each element of a list of ‘Fixed Price’ header cells.

Get all ten ‘Fixed Price’ headers, and separate each into its own list element. Since each cell has a unique combination of row and col, that combination can be used to separate the cells into list elements.

Adapt the code for a single multiple a ‘tidy’ function to tidy a general small multiple, starting from the ‘Fixed Price’ header. Here this is done by substituting x for fixed_price, and wrapping the three sections of code in a function. Everything else is the same.

Finally, map the ‘tidy’ function to each ‘Fixed Price’ header, and bind the results. into one data frame. The purrr() package is used here, but this could also be done with the apply() family of functions.

## # A tibble: 240 x 6
##      row   col value.data  i.value                i.value.1 value.header
##    <int> <int> <chr>       <chr>                      <dbl> <chr>       
##  1    28     7 Fixed Price IF CIG Rocky Mountains      1.94 BID         
##  2    29     7 Fixed Price IF CIG Rocky Mountains      1.96 BID         
##  3    30     7 Fixed Price IF CIG Rocky Mountains      2.35 BID         
##  4    31     7 Fixed Price IF CIG Rocky Mountains      2.55 BID         
##  5    32     7 Fixed Price IF CIG Rocky Mountains      2.47 BID         
##  6    33     7 Fixed Price IF CIG Rocky Mountains      3.31 BID         
##  7    34     7 Fixed Price IF CIG Rocky Mountains      2.55 BID         
##  8    39     7 Fixed Price AECO / NIT                  2.38 BID         
##  9    40     7 Fixed Price AECO / NIT                  2.40 BID         
## 10    41     7 Fixed Price AECO / NIT                  2.55 BID         
## # ... with 230 more rows

Joining the row headers

So far, only the column headers have been joined, but there are also row headers on the left-hand side of the spreadsheet. The following code incorporates these into the final dataset.

Since the single column of row headers applies to every row of every small multiple (unlike the column headers), a global row_headers variable can be joined to each small multiple by using a simple W() join. This is incorporated into the definition of tidy() below (see the bottom line).

## # A tibble: 240 x 7
##      row   col value.data  i.value        i.value.1 value.header value    
##    <int> <int> <chr>       <chr>              <dbl> <chr>        <chr>    
##  1    28     7 Fixed Price IF CIG Rocky …      1.94 BID          Cash     
##  2    29     7 Fixed Price IF CIG Rocky …      1.96 BID          ROM      
##  3    30     7 Fixed Price IF CIG Rocky …      2.35 BID          Dec-01   
##  4    31     7 Fixed Price IF CIG Rocky …      2.55 BID          Dec-01 t…
##  5    32     7 Fixed Price IF CIG Rocky …      2.47 BID          Apr-02 t…
##  6    33     7 Fixed Price IF CIG Rocky …      3.31 BID          Nov-02 t…
##  7    34     7 Fixed Price IF CIG Rocky …      2.55 BID          One Year…
##  8    39     7 Fixed Price AECO / NIT          2.38 BID          Cash     
##  9    40     7 Fixed Price AECO / NIT          2.40 BID          ROM      
## 10    41     7 Fixed Price AECO / NIT          2.55 BID          Dec-01   
## # ... with 230 more rows

50-line code listing

library(unpivotr)
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)

path <- system.file("extdata/enron.xlsx",
                    package = "unpivotr")
cells <- xlsx_cells(path)

fixed_price <-
  cells %>%
  filter(character == "Fixed Price") %>%
  split(paste(.$row, .$col))

row_headers <-
  cells %>%
  filter(between(row, 17, 56), between(col, 2, 4), !is_blank) %>%
  # Concatenate rows like "Dec-01", "to", "Mar-02"
  mutate(character = ifelse(!is.na(character),
                            character,
                            format(date, "%b-%y"))) %>%
  select(row, col, character) %>%
  nest(-row) %>%
  mutate(value = map(data, ~ paste(.x$character, collapse = " ")),
         col = 2L) %>%
  select(row, col, value)

tidy <- function(x) {
  col_headers <-
    x %>%
    offset_N(cells, n = 1) %>%
    extend_E(cells, 3) %>%
    extend_S(cells, 2) %>%
    filter(!is_blank) %>%
    select(row, col, value = character) %>%
    split(.$row)
  datacells <-
    x %>%
    offset_S(cells, n = 2) %>%
    extend_E(cells, 4) %>%
    extend_S(cells,
             boundary = ~ is.na(numeric),
             edge = TRUE) %>%
    filter(!is_blank) %>%
    mutate(value = numeric) %>%
    select(row, col, value)
  datacells %>%
    NNW(col_headers[[1]]) %>%
    NNW(col_headers[[2]]) %>%
    N(col_headers[[3]]) %>%
    W(row_headers) # This is the only new line
}

map_df(fixed_price, tidy)