Introduction

The CaRpools (CRISPR AnalyzeR for Pooled Screens) package allows users to analyze raw NGS read count data from pooled CRISPR Screens in an end-to-end fashion and serves as a basis for creating customized reports for more advanced users.
These pooled screens must contain lentivirus-based libraries as they can be obtained via Addgene e.g. . It provides functions to create different quality control plots, normalize the data, compare the data and perform hit identification via three different methods.
Furthermore it can be used to completely analyze the date in a streamlined workflow with the provided analysis template.

With CaRpools, the user can analyze pooled CRISPR/Cas9 screens end-to-end in a standardized fashion allowing for reproducible data analysis.
This includes:

Download caRpools

CaRpools is available as an R package caRpools without the scripts and template files.
The complete package with the PERL scripts and all template files can be obtained from Github (https://github.com/boutroslab/carpools) and our website CRISPR-AnalyzeR.de.

We recommend to download the template files and Scripts from Github and install caRpools in R using the package installer `install.packages(“caRpools”).

Quality Control

Available Quality Control plots include (for sgRNAs or summarized for genes):

Annotation

Since several gene identifiers are used for generating CRISPR pooled KO libraries, e.g. EnsemblID,
loaded data-sets can automatically be annotated with further gene annotation like official gene symbol or descriptions
using the biomaRt interface.
More details can be found below in the section of get.gene.info or via ?get.gene.info / ?biomaRt.

Gene Read Count Data or Removing Gene Data

Furthermore, sgRNA read-count data can be aggregated (summed up) to corresponding genes aggregatetogenes or gene data can be excluded for the analysis using gene.remove.

Hit Analysis

Moreover, this package can be used to identify screening hits using either

and hit calling / data analysis can be compared between all the methods with compare.analysis or carpools.hit.overview. Finally evaluated data analysis can be visualized in different ways with carpools.hitident for all methods listed above.

sgRNA Information

Moreover, you will also be provided with in-depth information about the sgRNAs of your genes.
These can be derived from carpools.raw.genes for any gene or carpools.hit.sgrna for your hit candidates in an automated way.

Report

All this data can be used to automatically generate a standardized report using the provided R Markdown template file including all plots and tables for you data analysis.
Two different templates are provides:

The R Markdown templates provide the user with the ability to generate HTML output as well as PDF output (LATEX installation necessary) including high-quality plots.

Provided PDF templates:

Provided HTML templates:

Requirements and Installation

Virtual Box Image

We also included a Virtual Box Image that already includes all necessary software and package files.
You just need to install Virtual Box 5 from the Website.

You can download the caRpools virtual box image from our website crispr-analyzer.de or Github (https://github.com/boutroslab/carpools).

How to use the Virtual Box caRpools

Please see the VirtualBox tutorial for instructions.

Download caRpools

CaRpools is available as an R package caRpools without the scripts and template files.
The complete package with the PERL scripts and all template files can be obtained from Github (https://github.com/boutroslab/carpools) and our website.

We recommend to download the template files and Scripts from Github and install caRpools in R using the package installer `install.packages(“caRpools”).

Hardware Requirements

For CRISPR-Libraries of 12 K size (12K sgRNAs), caRpools will work on any laptop/PC with at least 4GB of RAM and a modern dual-core CPU.
CRISPR-Libraries with a size of more than 100 K (100 K sgRNAs) run best with at least 8 GB of RAM.

Software Requirements

CaRpools was tested on MacOSX Yosemite and Ubuntu 14.04 LTS.
However, it should work on any operating system that fulfills the software requirements.

The following software needs to be installed:

The following R packages need be installed (can be done via load.packages()):

BiomaRt and Annotation Requirements

Please note that for any annotation, biomaRt needs full access to the internet. In case of incorrect proxy settings, the report generation will fail with a biomaRt error.
This means that if any proxy server is used, this has to be configured before using caRpools as described in the following articles:

Installation Procedure

Install all software listed above according to the installation information stated on the software website.
All necessary R packages can be installed automatically by load.packages().

Dataset / Screening Requirements

Since CaRpools fosters reproducibility of CRISPR/Cas9 screens, the following requirements for pooled screening data must be fullfilled to analyze data:

The usage of more than two replicates at once is not yet implemented, but will be in the near future.

Files that can be loaded

The following files are required for data analysis:

Either
* NGS FASTQ file for each sample * Library reference in FASTA format

or
* Final read count file for each sample * Library reference in FASTA format

CaRpools accepts either FASTQ files or read count files for each sample. FASTQ file are then extracted and mapped using Bowtie2 against the library reference. Finally, read count files for each sample are generated.

As an alternative, these final read count files can be provided as well, so that no extraction or mapping is necessary.
In addition, a library reference file in FASTA format is necessary, usually this is the file that was used for ordering custom oligo libraries.

File structures are shown below.

Structure of NGS Readcount Data

General information about the FASTQ file format can be obtained in an easy-to-understand article from Wikipedia.

maschine.pattern
The machine pattern used for extracting the sequences is a regular expression to identify the read ID including your sequencing machine.
in the case of the above sample, the PERL regular expression used must be M01100.

Extraction Pattern

CaRpools extracts the integrated DNA sequence of your target sequence as a DNA barcode.
In order to extract this sequence, a PERL regular expression pattern is used to identify the desired nucleotide sequence, which is called seq.pattern.

As an example, part of a U6 promoter-driven sgRNA cassette is given as follows:

Since we want to extract the target sequence, the regular expression will use a part of the U6 promoter and a part of the sgRNA backbone to identify the target sequence.

CACC (.{20}) GTTTTAGAGC

The parenthesis are necessary to extract the target sequence, for more information please see RegExR.

Structure of FASTA Library Reference File

The library reference file must be in FASTA format and include ALL sgRNAs present in the data-set with exactly the same naming.
e.g.

Read-Count Data Files

CaRpools also takes read count files. If FASTQ files are provided and extracted/mapped with CaRpools, read count files will be created for each sample.
Data for each sample must be formatted in a tab-separated way as follows:

As shown in the above file, Gene1 is the Gene identifier, **_3423** is a unique part for identification of this sgRNA for the given gene and Gene1_3423 is the whole identifier which uniquely identifies this sgRNA within the data-set.

The name of each complete sgRNA must contain the gene it refers to either as gene symbol or any other identifier which shares the same separator in addition to a part of identifier that is unique for each sgRNA for that gene.
The name can therefore be anything, as long as the identifier is the same for each gene and the separator is the same for all data.

In principle a sgRNA identifier must consist of a gene identifier, followed by a seperator (e.g. _ or -) as well as a secondary sgRNA identifier.

Please note:
* Read counts MUST be numeric only. * Within the sgRNA/Gene identifer, no special characters except for _ and - are allowed!

File Loading

Files can be loaded via load.file(filename, header, sep) with the following arguments

  • filename: the filename as character
  • header: TRUE / FALSE , whether a header is used (as it is for the above sample)
  • sep: the seperator, default is \t for tab-separated files.

Please see ?read.table for a detailed description of the arguments.

Files and Folder Structure to use CaRpools

Please note: the MAIN FOLDER must be the R working directory!
Data and Script paths can be adjusted in the MIACCS file.

The following files are necessary to use CaRpools for report generation:

MIACCS.xls
Minimum Information About CRISPR/Cas Screens. This file needs to be filled out to provide all necessary information about the screen.

R Markdown Template files
Either CaRpools-extended-PDF.rmd, CaRpools-PDF.rmd or CaRpools-extended-HTML.rmd or CaRpools-HTML.rmd. Is the template for report generation.

Data Files
Two replicates per Control and Treated. Can be FASTQ files OR already mapped, not normalized read count files.

CRISPR-mapping.pl
PERL script to map your extracted FASTQ files, if desired (as indicated in the MIACCS.xls)

CRISPR-extract.pl
PERL script to extract 20 nt target sequence from FAST files, if desired (as indicated in the MIACCS.xls)

CaRpools.png
The logo file

The following files are necessary to use single CaRpools functions:

Data Files
Either raw read count files or FASTQ files (that need to be extracted and mapped using CaRpools)

Please note that CaRpools always starts with loading data files. For raw-read-count files, use load.file. For FASTQ files, please see the sections below.

CaRpools folder structure for Report Generation using raw Read Count files:

CaRpools folder structure for Report Generation using raw Read Count files AFTER REPORT GENERATION:

CaRpools folder structrue for Report Generation using FASTQ files:

CaRpools folder structure for Report Generation using FASTQ files AFTER REPORT GENERATION: