VCF data

Brian J. Knaus

2016-07-25

Most variant calling pipelines result in files containing variant information. The variant call format (vcf) is an increasingly popular format for this data. The genotypes in these files are typically intended to be filtered as an attempt to remove false positives or other problematic variants. A first step in working with this data is to understand their contents.

Three sections

A VCF file can be thought of as having three sections: a meta region, a fix region and a gt region. The meta region is at the top of the file. The information in the meta region defines the abbreviations used elsewhere in the file. It may also document software used to create the file as well as parameters used by this software. Below the meta region, the data are tabular. The first eight columns of this table contain information about each variant. This data may be common over all variants, such as its chromosomal position, or a summary over all samples, such as quality metrics. These data are fixed, or the same, over all samples. Beginning at column ten is a column for every sample. The values in these columns are information for each sample and each variant. The organization of each cell containing a genotype and associated information is specified in column nine. The location of these three regions within a file can be represented by the cartoon below.

Cartoon representation of VCF file organization

Cartoon representation of VCF file organization

The VCF file definition is flexible. This means that there are slots for certain types of data, but any particular software which creates a VCF file does not necessarily use them all. Similarly, authors have the opportunity to include new forms of data, forms which may not have been foreseen by the authors of the VCF definition. The result is that all VCF files do not contain the same information.

For this vignette, we’ll use the example data provided with vcfR.

library(vcfR)
data(vcfR_example)

The meta region

The meta region contains information about the file and its creation, as well as information to interpret abbreviations used elsewhere in the file. Each line of the meta region begins with a double pound sign (‘##’). The example which comes with vcfR is shown below. (Only the first 10 lines are shown for brevity.)

##  [1] "##fileformat=VCFv4.1"                                              
##  [2] "##source=\"GATK haplotype Caller, phased with beagle4\""           
##  [3] "##FILTER=<ID=LowQual,Description=\"Low quality\">"                 
##  [4] "##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths"
##  [5] "for the ref and alt alleles in the order listed\">"                
##  [6] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate"   
##  [7] "read depth (reads with MQ=255 or with bad mates are filtered)\">"  
##  [8] "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype"      
##  [9] "Quality\">"                                                        
## [10] "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"

The first line contains the version of the VCF format used in the file. This line is required. The second line specifies the software which created the VCF file. This is not required, so not all VCF files include it. When they do, the file becomes self documenting. Note that the alignment software is not included here because it was used upstream of the VCF file’s creation (aligners typically create *.SAM or *.BAM format files). Because the file can only include information about the software that created it, the entire pipeline does not get documented. Some VCF files may contain a line for every chromosome (or supercontig or contig depending on your genome), so they may become rather long. Here, the remaining lines contain INFO and FORMAT specifications which define abbreviations used in the fix and gt portions of the file.

The fix region

The fix region contains information for each variant which is sometimes summarized over all samples. The first eight columns of the fixed region and are titled CHROM, POS, ID, REF, ALT, QUAL, FILTER and INFO. This is per variant information which is ‘fixed’, or the same, over all samples. The first two columns indicate the location of the variant by chromosome and position within that chromosome. Here, the ID field has not been used, so it consists of missing data (NA). The REF and ALT columns indicate the reference and alternate allelic states. When multiple alternate allelic states are present they are delimited with commas. The QUAL column attempts to summarize the quality of each variant over all samples. The FILTER field is not used here but could contain information on whether a variant has passed some form of quality assessment.

##      CHROM              POS   ID REF ALT QUAL     FILTER
## [1,] "Supercontig_1.50" "2"   NA "T" "A" "44.44"  NA    
## [2,] "Supercontig_1.50" "246" NA "C" "G" "144.21" NA    
## [3,] "Supercontig_1.50" "549" NA "A" "C" "68.49"  NA    
## [4,] "Supercontig_1.50" "668" NA "G" "C" "108.07" NA    
## [5,] "Supercontig_1.50" "765" NA "A" "C" "92.78"  NA    
## [6,] "Supercontig_1.50" "780" NA "G" "T" "58.38"  NA

The eigth column, titled INFO, is a semicolon delimited list of information. It can be rather long and cumbersome, which is why its not presented in its entirety here. Each abbreviation in the INFO column should be defined in the meta section. For example, line three from the meta region tells us that ‘DP’ is the raw read depth. We can validate this by querying the meta portion.

strwrap(grep('DP', vcf@meta, value = TRUE))
## [1] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate"   
## [2] "read depth (reads with MQ=255 or with bad mates are filtered)\">"  
## [3] "##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read"
## [4] "depth; some reads may have been filtered\">"

We see that ‘DP’ is defined both for the INFO column and for the gt portion of the file (FORMAT). Returning to the INFO column, we see that the value of DP equals 197 for this variant (note the caveats in the meta line). This tells us that this variant was sequenced 197 times over all our samples. A number of other parameters are also included in this column which be explored by querying their definition in the meta region. Here we just look at what is present for the first variant.

unlist(strsplit(as.character(vcf@fix[1, 8]), split=";"))
##  [1] "AC=1"                    "AF=0.042"               
##  [3] "AN=24"                   "BaseQRankSum=1.332"     
##  [5] "ClippingRankSum=-0.097"  "DP=197"                 
##  [7] "FS=2.536"                "InbreedingCoeff=-0.0879"
##  [9] "MLEAC=1"                 "MLEAF=0.042"            
## [11] "MQ=21.70"                "MQ0=0"                  
## [13] "MQRankSum=0.148"         "QD=2.34"                
## [15] "ReadPosRankSum=0.832"    "SOR=1.414"

The gt region

The gt (genotype) region contains information for each variant for each sample. The values for each variant and each sample are colon delimited. Multiple types of data for each genotype may be stored in this manner. The format of the data is specified by column nine, the FORMAT column. Here we see that we have information for GT, AD, DP, GQ and PL. The definition of these acronyms can be referenced by querying the the meta region, as demonstrated previously. Every variant does not necessarily have the same information (e.g., SNPs and indels may be handled differently), so the rows are best treated separately.

##      FORMAT           BL2009P4_us23              
## [1,] "GT:AD:DP:GQ:PL" "0|0:62,0:62:99:0,190,2835"
## [2,] "GT:AD:DP:GQ:PL" "1|0:5,5:10:99:111,0,114"  
## [3,] "GT:AD:DP:GQ:PL" NA                         
## [4,] "GT:AD:DP:GQ:PL" "0|0:1,0:1:3:0,3,44"       
## [5,] "GT:AD:DP:GQ:PL" "0|0:2,0:2:6:0,6,49"       
## [6,] "GT:AD:DP:GQ:PL" "0|0:2,0:2:6:0,6,49"       
##      DDR7602                   IN2009T1_us22              
## [1,] "0|0:12,0:12:39:0,39,585" "0|0:37,0:37:99:0,114,1709"
## [2,] NA                        "0|1:2,1:3:16:16,0,48"     
## [3,] NA                        "0|0:2,0:2:6:0,6,51"       
## [4,] NA                        "1|1:0,1:1:3:25,3,0"       
## [5,] "0|0:1,0:1:3:0,3,34"      "0|0:1,0:1:3:0,3,31"       
## [6,] "0|0:1,0:1:3:0,3,34"      "0|0:3,0:3:9:0,9,85"       
##      LBUS5                     NL07434             
## [1,] "0|0:12,0:12:39:0,39,585" NA                  
## [2,] NA                        NA                  
## [3,] NA                        NA                  
## [4,] NA                        "0|0:1,0:1:3:0,3,28"
## [5,] "0|0:1,0:1:3:0,3,34"      "0|0:1,0:1:3:0,3,26"
## [6,] "0|0:1,0:1:3:0,3,34"      NA                  
##      P10127                    P10650                
## [1,] "0|0:8,0:8:24:0,24,360"   "0|0:1,0:1:3:0,3,45"  
## [2,] "0|0:1,0:1:3:0,3,25"      "0|1:3,1:4:14:14,0,86"
## [3,] "1|1:0,4:4:12:99,12,0"    NA                    
## [4,] "0|0:2,0:2:6:0,6,54"      "0|0:1,0:1:3:0,3,35"  
## [5,] "0|0:12,0:12:36:0,36,297" "0|0:2,0:2:6:0,6,49"  
## [6,] "0|0:10,0:10:30:0,30,274" "0|0:2,0:2:6:0,6,49"

vcfR

Using the R package vcfR, we can read VCF format files into memory using the function read.vcfR(). Once in memory we can use the head() method to summarize the information in the three VCF regions.

head(vcf)
## [1] "***** Object of class 'vcfR' *****"
## [1] "***** Meta section *****"
## [1] "##fileformat=VCFv4.1"
## [1] "##source=\"GATK haplotype Caller, phased with beagle4\""
## [1] "##FILTER=<ID=LowQual,Description=\"Low quality\">"
## [1] "##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths fo [Truncated]"
## [1] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read  [Truncated]"
## [1] "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">"
## [1] "First 6 rows."
## [1] 
## [1] "***** Fixed section *****"
##      CHROM              POS   ID REF ALT QUAL     FILTER
## [1,] "Supercontig_1.50" "2"   NA "T" "A" "44.44"  NA    
## [2,] "Supercontig_1.50" "246" NA "C" "G" "144.21" NA    
## [3,] "Supercontig_1.50" "549" NA "A" "C" "68.49"  NA    
## [4,] "Supercontig_1.50" "668" NA "G" "C" "108.07" NA    
## [5,] "Supercontig_1.50" "765" NA "A" "C" "92.78"  NA    
## [6,] "Supercontig_1.50" "780" NA "G" "T" "58.38"  NA    
## [1] 
## [1] "***** Genotype section *****"
##      FORMAT           BL2009P4_us23              
## [1,] "GT:AD:DP:GQ:PL" "0|0:62,0:62:99:0,190,2835"
## [2,] "GT:AD:DP:GQ:PL" "1|0:5,5:10:99:111,0,114"  
## [3,] "GT:AD:DP:GQ:PL" NA                         
## [4,] "GT:AD:DP:GQ:PL" "0|0:1,0:1:3:0,3,44"       
## [5,] "GT:AD:DP:GQ:PL" "0|0:2,0:2:6:0,6,49"       
## [6,] "GT:AD:DP:GQ:PL" "0|0:2,0:2:6:0,6,49"       
##      DDR7602                   IN2009T1_us22              
## [1,] "0|0:12,0:12:39:0,39,585" "0|0:37,0:37:99:0,114,1709"
## [2,] NA                        "0|1:2,1:3:16:16,0,48"     
## [3,] NA                        "0|0:2,0:2:6:0,6,51"       
## [4,] NA                        "1|1:0,1:1:3:25,3,0"       
## [5,] "0|0:1,0:1:3:0,3,34"      "0|0:1,0:1:3:0,3,31"       
## [6,] "0|0:1,0:1:3:0,3,34"      "0|0:3,0:3:9:0,9,85"       
##      LBUS5                     NL07434             
## [1,] "0|0:12,0:12:39:0,39,585" NA                  
## [2,] NA                        NA                  
## [3,] NA                        NA                  
## [4,] NA                        "0|0:1,0:1:3:0,3,28"
## [5,] "0|0:1,0:1:3:0,3,34"      "0|0:1,0:1:3:0,3,26"
## [6,] "0|0:1,0:1:3:0,3,34"      NA                  
## [1] "First 6 columns only."
## [1] 
## [1] "Unique GT formats:"
## [1] "GT:AD:DP:GQ:PL"
## [1]

We now have a summary of our VCF file which we can use to help understand what forms of information are contained within it. This information can be further explored with plotting functions and used to filter the VCF file for high quality variants. These topics are discussed in other vignettes.