Introduction to dcmodify

Mark van der Loo and Edwin de Jonge


A first statement

In the iris dataset, replace Sepal.Width with 4 value if it exceeds 4.

iris %<>% modify_so( if(Sepal.Width > 4 ) Sepal.Width <- 4 )

Why this package

Data cleaning work flows or scripts typically contain a lot of ‘if this do that’ type of statements. Such statements are typically condensed expert knowledge. With this package, such ‘data modifying rules’ are taken out of the code and become in stead parameters to the work flow. This allows you to maintain, document and reason about data modification rules separately from the flow of your programme.

This means you, the expert, can focus on the content and let R do the work.

Basic workflow

The workflow of dcmodify is designed to take two concerns of your hands. The first concern is how to implement the many ideas and rules that define how and when to modify data. The second concern is related to how to apply such rules to your data. We therefore introduce two nouns and one verb that govern the basic workflow.

Here’s an example using the retailers data set from the validate package.

data("retailers", package="validate")
##   staff turnover other.rev total.rev staff.costs total.costs profit vat
## 1    75       NA        NA      1130          NA       18915  20045  NA
## 2     9     1607        NA      1607         131        1544     63  NA
## 3    NA     6886       -33      6919         324        6493    426  NA

First we define a set of modifying rules, using modifier.

m <- modifier(
  if (other.rev < 0) other.rev <- -1 * other.rev
  , if ( ) staff.costs <- mean(staff.costs)

Next, the rules can be applied to our data.

ret1 <- modify(retailers,m)

Alternatively, if you’re a fan of the magrittr, package you can do this

ret2 <- retailers %>% modifier(m)

or even

retailers %<>% modify_so(
  if ( other.rev < 0) other.rev <- -1 * other.rev
  , if ( ) staff.costs <- mean(staff.costs)

Here, the %<>% operator makes sure that the original dataset gets overwritten, and modify_so is a shortcut function for defining modificaton rules in-line.

Handling missing values

The rules you define in a modifier are executed on records where the conditions yields TRUE. In R this poses the problem on what to do when in a record the condition evaluates to NA. For example, the condition

other.rev < 0

in the first rule of m above evaluates to NA in the first record of the retailers dataset. Such cases are handled by treating it as if the condition evaluated to FALSE.

Importing and exporting rules from file

Performance, and a glimpse under the hood.

You, the user can assume that the rules are evaluated record-by-record. In reality, the package is smart enough to analyse the rules a little bit and to make sure they can be evaluated in a vectorized manner. This way explicit (and slow) R-loops are avoided as much as possible.

In short, when you call modify, or modify_so, the following steps are performed.

  1. The rules are transformed to statements that can be executed in a vectorized manner by R.
  2. If any macros present, they are inserted into the statements
  3. For each assignment, the conditions under which they should be executed are collected.
  4. The conditions are evaluated and assignments are exectuted on a selection of the data.

Difference with dplyr::mutate

The functionality of this package in resembles dplyr::mutate, since it also allows one to specify data mutations on data frames (or other tabular data objects). The dplyr package is especially usefull for interactive use, for use in programming through the ‘underscored’ functions such as mutate_.

The dcmodify package has been developed with a production street in mind where similar data sets are processed frequently. By taking the modifying rules out of the software, R programmers can build an application that allows users that are less knowledgable about programming to specify their modification rules.

Logging changes

It can be interesting to study the effect of a certan set of data modifying rules. The lumberjack package is capable of tracking changes in data.

To start logging data you need to replace the magrittr pipe (%>%) with the lumberjack operator %>>% and insert some logging commands into the pipeline.

# add primary key so cellwise changes can be traced
women$id <- letters[1:15]

out <- women %>>%
  start_log( cellwise$new(key="id") ) %>>%
  modify_so( if (height < mean(height)) height <- mean(height) ) %>>%
## Dumped a log at cellwise.csv
# The log is written to file.
read.csv("cellwise.csv") %>>% head()
##   step                     time
## 1    1 2018-07-30 16:42:05 CEST
## 2    1 2018-07-30 16:42:05 CEST
## 3    1 2018-07-30 16:42:05 CEST
## 4    1 2018-07-30 16:42:05 CEST
## 5    1 2018-07-30 16:42:05 CEST
## 6    1 2018-07-30 16:42:05 CEST
##                                                     expression key
## 1 modify_so(if (height < mean(height)) height <- mean(height))   a
## 2 modify_so(if (height < mean(height)) height <- mean(height))   b
## 3 modify_so(if (height < mean(height)) height <- mean(height))   c
## 4 modify_so(if (height < mean(height)) height <- mean(height))   e
## 5 modify_so(if (height < mean(height)) height <- mean(height))   f
## 6 modify_so(if (height < mean(height)) height <- mean(height))   g
##   variable old new
## 1   height  58  61
## 2   height  59  61
## 3   height  60  61
## 4   height  62  61
## 5   height  63  61
## 6   height  64  61

Current limitations

Conditional statements including else are not suported yet. Rules containing if() else are ignored with a warning.