An overview of targets

This vignette is a high-level overview of targets and its educational materials. The goal is to summarize the major features of targets and direct users to the appropriate resources. It explains how to get started, and then it briefly describes each chapter of the user manual.

What is targets?

The targets R package is a Make-like pipeline toolkit for Statistics and data science in R. targets accelerates analysis with easy-to-configure parallel computing, enhances reproducibility, and reduces the burdens of repeated computation and manual data micromanagement. A fully up-to-date targets pipeline is tangible evidence that the output aligns with the code and data, which substantiates trust in the results.

How to get started

The top of the reference website links to a number of materials to help new users start learning targets. It lists online talks, tutorials, books, and workshops in the order that a new user should consume them. The rest of the main page outlines a more comprehensive list of resources.

The walkthrough

The user manual starts with a walkthrough chapter, a short tutorial to quickly started with targets using a simple example project. That project also has a repository with the source code and an RStudio Cloud workspace that lets you try out the workflow in a web browser. Sign up for a free RStudio Cloud account, click on the link, and try out functions tar_make() and tar_read() in the R console.

Debugging

The debugging chapter describes two alternative built-in systems for troubleshooting errors. The first system uses workspaces, which let you load a target’s dependencies into you R session. This way is usually preferred, especially with large pipelines on computing clusters, but it still may require some manual work. The second system launches an interactive debugger while the pipeline is actually running, which may not be feasible in some situations, but can often help you reach the problem more quickly.

Functions

targets expects users to adopt a function-oriented style of programming. User-defined R functions are essential to express the complexities of data generation, analysis, and reporting. The user manual has a whole chapter dedicated to user-defined functions for data science, and it explains why they are important and how to use them in targets-powered pipelines.

Best practices

The best practices chapter contains general recommendations about targets pipelines in the real world, including how to define good targets, what to do if your workflow is itself an R package, and how to diagnose performance issues.

External files and literate programming

targets has special ways to include data files and literate programming reports in a pipeline. This functionality is optional in the general case, but it is necessary if you want a target to rerun in response to a change in a data file, or if you want an R Markdown report to re-render when an upstream target changes. The files chapter walks through this functionality, from input data to parameterized R Markdown.

Dynamic branching

Sometimes, a pipeline contains more targets than a user can comfortably type by hand. For projects with hundreds of targets, branching can make the _targets.R file more concise and easier to read and maintain. Dynamic branching is a way to create new targets while the pipeline is running, and it is best suited to iterating over a larger number of very similar tasks. The dynamic branching chapter outlines this functionality, including how to create branching patterns, different ways to iterate over data, and recommendations for batching large numbers of small tasks into a comfortably small number of dynamic branches.

Static branching

Static branching is the act of defining a group of targets in bulk before the pipeline starts. Whereas dynamic branching uses last-minute dependency data to define the branches, static branching uses metaprogramming to modify the code of the pipeline up front. Whereas dynamic branching excels at creating a large number of very similar targets, static branching is most useful for smaller number of heterogeneous targets. Some users find it more convenient because they can use tar_manifest() and tar_visnetwork() to check the correctness of static branching before launching the pipeline. Read more about it in the static branching chapter.

High-performance computing

targets is capable of distributing the computation in a pipeline across multiple cores of a laptop or multiple nodes of a computing cluster. Not only does it interface with these technologies using packages clustermq and future: it automatically deploys ready targets to parallel workers while making sure the other targets wait for their upstream dependencies to finish. Read more about high-performance computing in the HPC chapter.

Cloud computing

Users with Amazon Web Services accounts can store their targets on one or more S3 buckets, and retrieval with tar_read() and tar_load() is seamless. The cloud computing chapter is a step-by-step guide that walks through how to get started with Amazon Web Services and connect a targets pipeline to S3.

What about drake?

The drake package is an older and more established R-focused pipeline toolkit, and it is the predecessor of targets. The drake chapter of the targets manual helps drake users understand the role of targets, the future direction of drake, how to transition to targets, and the advantages of targets over drake.