This package allows you to monitor changes in data as they get processed. It implements an easy-to-use and extensible logging framework, and comes with a few data loggers implemented.

This vignette will show you how to get started and what the default loggers do. The extending lumberjack vignette explains how to build your own loggers.

Installation

Install the package with

install.packages("lumberjack")

The lumberjack workflow

So you want to know who does what to your data as it flows through your process. Here's the workflow that allows you to do it using the lumberjack package. (Note the use of %L>%!)

out <- women %L>%
  start_log() %L>%
  identity() %L>%
  head() %L>%
  dump_log()
## Dumped a log at /tmp/RtmptwQqdO/Rbuild4d9023d66089/lumberjack/vignettes/simple_log.csv
read.csv("simple_log.csv")
##   step                time expression changed
## 1    1 2018-07-20 10:01:00 identity()   FALSE
## 2    2 2018-07-20 10:01:00     head()    TRUE

Lets go through this step by step to see what happened. The start of the script defines an output variable out and passes women to the lumberjack (%L>%). Next, the function start_log makes sure that logging starts from there. We are now ready to start performing logged transformations on our dataset. First, we apply the identity function, which does exactly nothing. Then, the head function selects the first six rows in of the dataset and dump_log() writes the log to a csv file, which we then read in. After the log is dumped, logging stops automatically (by default).

The logging data consists of a step number, a timestamp, the expression evaluated to transform the data, and an indicator whether the data had changed at all. As expected, the identity function hasn't changed anything and the head function cuts of all records below the sixth row.

By the way, the variable out contains the first six records of the women dataset as expected.

out
##   height weight
## 1     58    115
## 2     59    117
## 3     60    120
## 4     61    123
## 5     62    126
## 6     63    129

You have now seen the most important functions of the package. Let's summarize them.

All these functions are data-in, data-out. You are probably used to this from using dplyr or one of its siblings. However, the lumberjack functions are not limited to data.frame-like objects. In principle, changes to any object type can be logged, but it depends on the logger whether that will actually work – most will expect a particular data structure.

Changing the logger

Just tell start_log() what logger to use. In the example below we use the builtin cellwise logger. For this logger it is necessary to have a key column that identifies the rows uniquely so we add that first (we use within here, this is base R's equivalent to dplyr's mutate).

logfile <- tempfile(fileext = ".csv") # where the logging info is written
women$a_key <- sprintf("W%02d", seq_len(nrow(women)))   # add a primary key to 'women'

# make the small example ea bit smaller
wom <- head(women,5)

out <- wom %L>%
  start_log( log = cellwise$new(key="a_key") ) %L>%
  within(height <- sqrt(height)) %L>%
  within(weight <- weight*2) %L>%
  dump_log(file=logfile, stop=TRUE)
## Dumped a log at /tmp/Rtmp2bTkcR/file4da7161af9bc.csv
read.csv(logfile)
##    step                     time                     expression key
## 1     1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W01
## 2     1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W02
## 3     1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W03
## 4     1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W04
## 5     1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W05
## 6     2 2018-07-20 10:01:00 CEST   within(weight <- weight * 2) W01
## 7     2 2018-07-20 10:01:00 CEST   within(weight <- weight * 2) W02
## 8     2 2018-07-20 10:01:00 CEST   within(weight <- weight * 2) W03
## 9     2 2018-07-20 10:01:00 CEST   within(weight <- weight * 2) W04
## 10    2 2018-07-20 10:01:00 CEST   within(weight <- weight * 2) W05
##    variable old        new
## 1    height  58   7.615773
## 2    height  59   7.681146
## 3    height  60   7.745967
## 4    height  61   7.810250
## 5    height  62   7.874008
## 6    weight 115 230.000000
## 7    weight 117 234.000000
## 8    weight 120 240.000000
## 9    weight 123 246.000000
## 10   weight 126 252.000000

Available loggers

Here's a short overview of known loggers.

In the lumberjack package

In other packages

See the extending lumberjack vignette on how to build your own loggers.

One more example: the expression logger

The expression logger allows you to log the result of one or more expressions that will be evaluated after each data processing step. For example, suppose we want to follow the mean and variance of variables in the women dataset as it gets processed.

logger <- expression_logger$new(mnh = mean(height), sdh = sd(height))
out <- women %L>%
  start_log(logger) %L>%
  transform(height <- height*2.54) %L>% # height in cm
  transform(weight <- weight*0.453592) %L>%
  dump_log()
## Dumped a log at expression_log.csv
read.csv("expression_log.csv",stringsAsFactors = FALSE)
##   step                             expression mnh      sdh
## 1    1     transform(height <- height * 2.54)  65 4.472136
## 2    2 transform(weight <- weight * 0.453592)  65 4.472136

Changing logger behaviour

There are two ways to change how a logger behaves. By setting options at initialization and by setting options when dumping a log.

Setting options for the logger

The start_log function adds a logging object as an attribute to its input data. By default, this is the simple logger, which only checks whether data has changed at all. The behavior of this logger can be changed by passing options when it is created. To see this, have a look at the complete call, as it is executed by default.

dat <- start_log(women, log = simple$new())

The expression simple$new() creates a new logging object, and start_log makes sure it is attached as an attribute to the copy of the women dataset stored in dat. The simple logger has one option called verbose, that can be set when calling $new. The default is TRUE, here we set it to FALSE.

dat <- start_log(women, log=simple$new(verbose=FALSE))

The effect is that no message is printed when the log is dumped to file.

out <- dat %L>% identity() %L>% dump_log()
read.csv("simple_log.csv")
##   step                time expression changed
## 1    1 2018-07-20 10:01:01 identity()   FALSE

Note that the available options depend logger you use. Look at the logger's helpfile (?simple, ?cellwise) to see all options.

Setting options for the output

For the simple logger, the default output file is simple_log.csv This can be changed when calling dump_log.

out <- dat %L>% 
  start_log() %L>% 
  identity() %L>% 
  dump_log(file="log_all_day.csv")
## Dumped a log at /tmp/RtmptwQqdO/Rbuild4d9023d66089/lumberjack/vignettes/log_all_day.csv
read.csv("log_all_day.csv")
##   step                time expression changed
## 1    1 2018-07-20 10:01:01 identity()   FALSE

The function dump_log passes most of its arguments to the logger's $dump() method. See the help file of the logger for the options (?simple, ?cellwise).

Options for other loggers

Loggers can come in different forms. In principle, authors are free to use R6 classes (as is done here), Reference classes, or anything else that follows the lumberjack API. This means that the way that logging objects are initialized may vary from logger to logger. Check the documentation of a logger to see how to operate it. Maintainers of packages that offer loggers that work with the lumberjack are kindly requested to list the lumberjack in the Enhances field of the DESCRIPTION file, so they can be found through lumberjack's CRAN page.

Properties of the lumberjack

There are several function composition ('pipe') operators in the R community, including magrittr, pipeR and yapo. All have different behavior.

The lumberjack operator behaves as a simplified version of the magrittr pipe operator. Here are some examples.

# pass the first argument to a function
1:3 %L>% mean()

# pass arguments using "."
TRUE %L>% mean(c(1,NA,3), na.rm = .)

# pass arguments to an expression, using "."
1:3 %L>% { 3 * .}

# in a more complicated expression, return "." explicitly
women %L>% { .$height <- 2*.$height; . }

The main differences with magrittr are that

Logging changes on non-data.frame objects

This is possible, but the logger has to support it. The simple logger works for any object, but the cellwise logger works on data.frame-like objects only.

out <- 1:3 %L>% 
  start_log() %L>%
  {.*2} %L>%
  dump_log(file="foo.csv")
## Dumped a log at /tmp/RtmptwQqdO/Rbuild4d9023d66089/lumberjack/vignettes/foo.csv
print(out)
## [1] 2 4 6
read.csv("foo.csv")
##   step                time      expression changed
## 1    1 2018-07-20 10:01:01 {\n    . * 2\n}    TRUE