In any large scale experiment many things can and will go wrong. The cluster might have an outage, jobs may run into resource limits or crash, subtle bugs in your code could be triggered or any other error condition might arise. In these situations it is important to quickly determine what went wrong and to recompute only the minimal number of required jobs.

Therefore, before you submit anything you should use testJob() to catch errors that are easy to spot because they are raised in many or all jobs. If external is set, this function runs the job without side effects in an independent R process on your local machine via Rscript similar as on the slave, redirects the output of the process to your R console, loads the job result and returns it. If you do not set external, the job is executed is in the currently running R session, with the drawback that you might be unable to catch missing variable declarations or missing package dependencies.

By way of illustration here is a small example. First, we create a temporary registry.

library(batchtools)
reg = makeRegistry(file.dir = NA, seed = 1)

Ten jobs are created, two of them will throw an exception.

flakeyFunction <- function(value) {
  if (value %in% c(2, 9)) stop("Ooops.")
  value^2
}
batchMap(flakeyFunction, 1:10)
## Adding 10 jobs ...

Now that the jobs are defined, we can test jobs independently:

testJob(id = 1)
## [1] 1

In this case, testing the job with ID = 1 provides the appropriate result but testing the job with ID = 2 leads to an error:

as.character(try(testJob(id = 2)))
## [1] "Error in (function (value)  : Ooops.\n"

When you have already submitted the jobs and suspect that something is going wrong, the first thing to do is to run getStatus() to display a summary of the current state of the system.

submitJobs()
## Submitting 10 jobs in 10 chunks using cluster functions 'Interactive' ...
waitForJobs()
## Syncing 10 files ...
## [1] FALSE
getStatus()
## Status for 10 jobs:
##   Submitted : 10 (100.0%)
##   Queued    :  0 (  0.0%)
##   Started   : 10 (100.0%)
##   Running   :  0 (  0.0%)
##   Done      :  8 ( 80.0%)
##   Error     :  2 ( 20.0%)
##   Expired   :  0 (  0.0%)

The status message shows that two of the jobs could not be executed successfully. To get the IDs of all jobs that failed due to an error we can use findErrors() and to retrieve the actual error message, we can use getErrorMessages().

findErrors()
##    job.id
## 1:      2
## 2:      9
getErrorMessages()
##    job.id terminated error                              message
## 1:      2       TRUE  TRUE Error in (function (value)  : Ooops.
## 2:      9       TRUE  TRUE Error in (function (value)  : Ooops.

If we want to peek into the R log file of a job to see more context for the error we can use showLog() which opens a pager or use getLog() to get the log as character vector:

writeLines(getLog(id = 9))
## ### [bt 2017-04-21 22:49:06]: This is batchtools v0.9.3
## ### [bt 2017-04-21 22:49:06]: Starting calculation of 1 jobs
## ### [bt 2017-04-21 22:49:06]: Setting working directory to '/tmp'
## ### [bt 2017-04-21 22:49:06]: Memory measurement disabled
## ### [bt 2017-04-21 22:49:06]: Starting job [batchtools job.id=9]
## Error in (function (value)  : Ooops.
## 
## ### [bt 2017-04-21 22:49:06]: Job terminated with an exception [batchtools job.id=9]
## ### [bt 2017-04-21 22:49:06]: Calculation finished!

You can also grep for error or warning messages:

ids = grepLogs(pattern = "ooops", ignore.case = TRUE)
print(ids)
##    job.id                              matches
## 1:      2 Error in (function (value)  : Ooops.
## 2:      9 Error in (function (value)  : Ooops.