This package allows you to monitor changes in data as they get processed. It implements an easy-to-use and extensible logging framework, and comes with a few data loggers implemented.
This vignette will show you how to get started and what the default loggers do. The extending lumberjack vignette explains how to build your own loggers.
Install the package with
install.packages("lumberjack")
So you want to know who does what to your data as it flows through your
process. Here's the workflow that allows you to do it using the lumberjack
package. (Note the use of %L>%
!)
out <- women %L>%
start_log() %L>%
identity() %L>%
head() %L>%
dump_log()
## Dumped a log at /tmp/RtmptwQqdO/Rbuild4d9023d66089/lumberjack/vignettes/simple_log.csv
read.csv("simple_log.csv")
## step time expression changed
## 1 1 2018-07-20 10:01:00 identity() FALSE
## 2 2 2018-07-20 10:01:00 head() TRUE
Lets go through this step by step to see what happened. The start of the script
defines an output variable out
and passes women
to the lumberjack (%L>%
).
Next, the function start_log
makes sure that logging starts from there. We are
now ready to start performing logged transformations on our dataset. First, we
apply the identity
function, which does exactly nothing. Then, the head
function selects the first six rows in of the dataset and dump_log()
writes the log to a csv file, which we then read in. After the log is dumped,
logging stops automatically (by default).
The logging data consists of a step number, a timestamp, the expression
evaluated to transform the data, and an indicator whether the data had changed
at all. As expected, the identity
function hasn't changed anything and the
head
function cuts of all records below the sixth row.
By the way, the variable out
contains the first six records of the women
dataset as
expected.
out
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
You have now seen the most important functions of the package. Let's summarize them.
start_log(data, log)
: start logging using possibly a custom logger (see next section)%L>%
: the lumberjack. a logging-aware function composition operator ('pipe').dump_log(data, stop, ...)
: dump the log for data
(if present)stop_log(data)
: stop logging.All these functions are data-in, data-out. You are probably used to this from
using dplyr or one of its siblings. However, the
lumberjack
functions are not limited to data.frame
-like objects. In
principle, changes to any object type can be logged, but it depends on the
logger whether that will actually work – most will expect a particular data
structure.
Just tell start_log()
what logger to use. In the example below we use the
builtin cellwise
logger. For this logger it is necessary to have a key column
that identifies the rows uniquely so we add that first (we use within
here,
this is base R's equivalent to dplyr
's mutate
).
logfile <- tempfile(fileext = ".csv") # where the logging info is written
women$a_key <- sprintf("W%02d", seq_len(nrow(women))) # add a primary key to 'women'
# make the small example ea bit smaller
wom <- head(women,5)
out <- wom %L>%
start_log( log = cellwise$new(key="a_key") ) %L>%
within(height <- sqrt(height)) %L>%
within(weight <- weight*2) %L>%
dump_log(file=logfile, stop=TRUE)
## Dumped a log at /tmp/Rtmp2bTkcR/file4da7161af9bc.csv
read.csv(logfile)
## step time expression key
## 1 1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W01
## 2 1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W02
## 3 1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W03
## 4 1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W04
## 5 1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W05
## 6 2 2018-07-20 10:01:00 CEST within(weight <- weight * 2) W01
## 7 2 2018-07-20 10:01:00 CEST within(weight <- weight * 2) W02
## 8 2 2018-07-20 10:01:00 CEST within(weight <- weight * 2) W03
## 9 2 2018-07-20 10:01:00 CEST within(weight <- weight * 2) W04
## 10 2 2018-07-20 10:01:00 CEST within(weight <- weight * 2) W05
## variable old new
## 1 height 58 7.615773
## 2 height 59 7.681146
## 3 height 60 7.745967
## 4 height 61 7.810250
## 5 height 62 7.874008
## 6 weight 115 230.000000
## 7 weight 117 234.000000
## 8 weight 120 240.000000
## 9 weight 123 246.000000
## 10 weight 126 252.000000
Here's a short overview of known loggers.
simple
Just check whether data has changed.cellwise
Track changes per cell (incl. old value, new value)filedump
Dump a file after each step (including the zeroth step.)expression_logger
Track the result of any expressionvalidate::lbj_rules
Track changes in data quality measured by validation rules (validate version >= 0.2.0).validate::lbj_cells
Track changes in cell filling and cell counts (validate version >=0.2.0 ).daff::lbj_daff
Use data-diff to track changes in data frame-like objects. (daff version >= 3.3)See the extending lumberjack vignette on how to build your own loggers.
The expression logger allows you to log the result of one or more expressions that will
be evaluated after each data processing step. For example, suppose we want to follow the
mean and variance of variables in the women
dataset as it gets processed.
logger <- expression_logger$new(mnh = mean(height), sdh = sd(height))
out <- women %L>%
start_log(logger) %L>%
transform(height <- height*2.54) %L>% # height in cm
transform(weight <- weight*0.453592) %L>%
dump_log()
## Dumped a log at expression_log.csv
read.csv("expression_log.csv",stringsAsFactors = FALSE)
## step expression mnh sdh
## 1 1 transform(height <- height * 2.54) 65 4.472136
## 2 2 transform(weight <- weight * 0.453592) 65 4.472136
There are two ways to change how a logger behaves. By setting options at initialization and by setting options when dumping a log.
The start_log
function adds a logging object as an attribute to its input
data. By default, this is the simple
logger, which only checks whether data
has changed at all. The behavior of this logger can be changed by
passing options when it is created. To see this, have a look at the complete
call, as it is executed by default.
dat <- start_log(women, log = simple$new())
The expression simple$new()
creates a new logging object, and start_log
makes sure it is attached as an attribute to the copy of the women
dataset
stored in dat
. The simple logger has one option called verbose
, that can be set when calling $new
. The default is TRUE
, here we set it to FALSE
.
dat <- start_log(women, log=simple$new(verbose=FALSE))
The effect is that no message is printed when the log is dumped to file.
out <- dat %L>% identity() %L>% dump_log()
read.csv("simple_log.csv")
## step time expression changed
## 1 1 2018-07-20 10:01:01 identity() FALSE
Note that the available options depend logger you use. Look at the logger's
helpfile (?simple
, ?cellwise
) to see all options.
For the simple logger, the default output file is simple_log.csv
This can be
changed when calling dump_log
.
out <- dat %L>%
start_log() %L>%
identity() %L>%
dump_log(file="log_all_day.csv")
## Dumped a log at /tmp/RtmptwQqdO/Rbuild4d9023d66089/lumberjack/vignettes/log_all_day.csv
read.csv("log_all_day.csv")
## step time expression changed
## 1 1 2018-07-20 10:01:01 identity() FALSE
The function dump_log
passes most of its arguments to the logger's $dump()
method. See the help file of the logger for the options (?simple
, ?cellwise
).
Loggers can come in different forms. In principle, authors are free to use R6
classes (as is done here), Reference classes, or anything else that follows the
lumberjack API. This means that the way that logging objects are initialized may
vary from logger to logger. Check the documentation of a logger to see how to
operate it. Maintainers of packages that offer loggers that work with the
lumberjack are kindly requested to list the lumberjack
in the Enhances
field
of the DESCRIPTION
file, so they can be found through lumberjack
's CRAN page.
There are several function composition ('pipe') operators in the R community, including magrittr, pipeR and yapo. All have different behavior.
The lumberjack operator behaves as a simplified version of the magrittr
pipe
operator. Here are some examples.
# pass the first argument to a function
1:3 %L>% mean()
# pass arguments using "."
TRUE %L>% mean(c(1,NA,3), na.rm = .)
# pass arguments to an expression, using "."
1:3 %L>% { 3 * .}
# in a more complicated expression, return "." explicitly
women %L>% { .$height <- 2*.$height; . }
The main differences with magrittr
are that
%<>%
.a <- . %>% sin(.)
pi %>% sin
and expect an answerThis is possible, but the logger has to support it. The simple
logger works
for any object, but the cellwise
logger works on data.frame-like objects only.
out <- 1:3 %L>%
start_log() %L>%
{.*2} %L>%
dump_log(file="foo.csv")
## Dumped a log at /tmp/RtmptwQqdO/Rbuild4d9023d66089/lumberjack/vignettes/foo.csv
print(out)
## [1] 2 4 6
read.csv("foo.csv")
## step time expression changed
## 1 1 2018-07-20 10:01:01 {\n . * 2\n} TRUE