Reading and preparing the data

Hakon K. Gjessing and Julia Romanowska

2019-04-17

Reading the data into Haplin

Haplin reads data in two formats:

  1. Haplin’s own text file format;

  2. PED format.

Both types of data are read in through the use of genDataRead function:

dir.exmpl <- system.file( "extdata", package = "Haplin" )
exemplary.file1 <- paste0( dir.exmpl, "/HAPLIN.trialdata.txt" )

my.gen.data.haplin <- genDataRead( file.in = exemplary.file1, file.out = "trial_data1",
    dir.out = ".", format = "haplin", n.vars = 0 )
## Reading the data in chunks...
##  -- chunk 1 --
##  -- chunk 2 -- 
## ... done reading.
## Preparing data...
## ... done preparing
## Saving data...
## ... saved to files: ./trial_data1_gen.ffData, ./trial_data1_gen.RData
exemplary.file3 <- paste0( dir.exmpl, "/exmpl_data.ped" )
my.gen.data <- genDataRead( exemplary.file3, file.out = "ped_data", dir.out = ".",
    format = "ped" )
## 'n.vars' was not given explicitly and will be set to 6 based on the format given.
## Reading the data in chunks...
##  -- chunk 1 --
##  -- chunk 2 -- 
## ... done reading.
## Preparing data...
## ... done preparing
## Saving data...
## ... saved to files: ./ped_data_gen.ffData, ./ped_data_gen.RData

The function reads in all the data in the file, creates ff objects to store the genetic information and data.frame to store covariate data (if any). These objects are saved in .RData and .ffData files, which can be later on easily uploaded to R (with genDataLoad) and re-used.

CAUTION: This can take a long time for large datasets (such as from GWAS analysis, e.g., reading in a 7 GB file will take ca.15 minutes), however, this needs to be run only once and then, the next time you need to use the data, use the genDataLoad function (see below). Be careful NOT TO DELETE the output files!

The genDataRead function returns a list object with two elements:

The above function reads in also any additional covariate data the user has through the parameter cov.file.in:

my.gen.data.haplin2 <- genDataRead( file.in = "my_gen_data_hap.dat",
  file.out = "my_saved_gen_data_hap2", cov.file.in = "add_cov.dat",
  dir.out = ".", format = "haplin" )

The file with the additional information should have a header with names of the data columns!

To see all the available arguments and usage examples, type:

?genDataRead

(this works also with any other function)

Reading covariate data

Together with the genotype data, the user can read covariates that are in a separate file. This is done at the same time as reading genotype data.

## Reading the data in chunks...
##  -- chunk 1 --
##  -- chunk 2 -- 
## ... done reading.
## Preparing data...
## ... done preparing
## Reading covariate file... 
##     'cov.header' not given - assuming the first line is the header...
## ...done
## Saving data...
## ... saved to files: ./trial_data3_gen.ffData, ./trial_data3_gen.RData
## This is raw genetic data read in through genDataRead.
## 
## It contains the following parts:
##    cov.data, gen.data, aux 
## 
## with following dimensions:
##   - covariate variables =  smoke, vitamin, pheno1, pheno2 
##       (total  4  covariate variables),
##   - number of markers =  2 ,  - number of data lines =  249
## 'n.vars' was not given explicitly and will be set to 6 based on the format given.
## Reading the data in chunks...
##  -- chunk 1 --
##  -- chunk 2 -- 
## ... done reading.
## Preparing data...
## ... done preparing
## Reading covariate file... 
##     'cov.header' not given - assuming the first line is the header...
## ...done
## Saving data...
## ... saved to files: ./ped_data2_gen.ffData, ./ped_data2_gen.RData
## This is raw genetic data read in through genDataRead.
## 
## It contains the following parts:
##    cov.data, gen.data, aux 
## 
## with following dimensions:
##   - covariate variables =  id.fam, id.c, id.f, id.m, sex, cc, smoke, vitamin, pheno1, pheno2 
##       (total  10  covariate variables),
##   - number of markers =  429 ,  - number of data lines =  1659

Accessing the information in the loaded data

The object created by genDataRead includes a lot of information. We have created functions that will help the user to navigate it.

Displaying and extracting phenotype information

First of all, when you type in the name of the object, a short summary will be displayed:

## This is raw genetic data read in through genDataRead.
## 
## It contains the following parts:
##    cov.data, gen.data, aux 
## 
## with following dimensions:
##   - covariate variables =  id.fam, id.c, id.f, id.m, sex, cc 
##       (total  6  covariate variables),
##   - number of markers =  429 ,  - number of data lines =  1659

If you want to show and/or extract part of phenotype information, you can use the showPheno function:

##      id.fam id.c id.f id.m sex cc 
## [1,] "1"    "1"  "3"  "2"  "2" "1"
## [2,] "1"    "2"  "0"  "0"  "2" "0"
## [3,] "1"    "3"  "0"  "0"  "1" "1"
## [4,] "2"    "1"  "3"  "2"  "1" "0"
## [5,] "2"    "2"  "0"  "0"  "2" "0"
##       id.fam id.c id.f id.m sex cc 
##  [1,] "1"    "1"  "3"  "2"  "2" "1"
##  [2,] "1"    "2"  "0"  "0"  "2" "0"
##  [3,] "1"    "3"  "0"  "0"  "1" "1"
##  [4,] "2"    "1"  "3"  "2"  "1" "0"
##  [5,] "2"    "2"  "0"  "0"  "2" "0"
##  [6,] "2"    "3"  "0"  "0"  "1" "1"
##  [7,] "3"    "1"  "3"  "2"  "1" "0"
##  [8,] "3"    "2"  "0"  "0"  "2" "1"
##  [9,] "3"    "3"  "0"  "0"  "1" "0"
## [10,] "4"    "1"  "3"  "2"  "1" "1"
## [11,] "4"    "2"  "0"  "0"  "2" "0"
## [12,] "4"    "3"  "0"  "0"  "1" "0"
## [13,] "5"    "1"  "3"  "2"  "2" "1"
## [14,] "5"    "2"  "0"  "0"  "2" "0"
## [15,] "5"    "3"  "0"  "0"  "1" "1"
## [16,] "6"    "1"  "3"  "2"  "1" "1"
## [17,] "6"    "2"  "0"  "0"  "2" "1"
## [18,] "6"    "3"  "0"  "0"  "1" "0"
## [19,] "7"    "1"  "3"  "2"  "2" "1"
## [20,] "7"    "2"  "0"  "0"  "2" "1"
##       id.fam id.c id.f id.m sex cc 
##  [1,] "2"    "1"  "3"  "2"  "1" "0"
##  [2,] "2"    "2"  "0"  "0"  "2" "0"
##  [3,] "2"    "3"  "0"  "0"  "1" "1"
##  [4,] "3"    "1"  "3"  "2"  "1" "0"
##  [5,] "3"    "2"  "0"  "0"  "2" "1"
##  [6,] "3"    "3"  "0"  "0"  "1" "0"
##  [7,] "4"    "1"  "3"  "2"  "1" "1"
##  [8,] "4"    "2"  "0"  "0"  "2" "0"
##  [9,] "4"    "3"  "0"  "0"  "1" "0"
## [10,] "5"    "1"  "3"  "2"  "2" "1"
## [11,] "5"    "2"  "0"  "0"  "2" "0"
## [12,] "5"    "3"  "0"  "0"  "1" "1"
##       id.fam id.c id.f id.m sex cc 
##  [1,] "1"    "1"  "3"  "2"  "2" "1"
##  [2,] "1"    "2"  "0"  "0"  "2" "0"
##  [3,] "2"    "2"  "0"  "0"  "2" "0"
##  [4,] "3"    "2"  "0"  "0"  "2" "1"
##  [5,] "4"    "2"  "0"  "0"  "2" "0"
##  [6,] "5"    "1"  "3"  "2"  "2" "1"
##  [7,] "5"    "2"  "0"  "0"  "2" "0"
##  [8,] "6"    "2"  "0"  "0"  "2" "1"
##  [9,] "7"    "1"  "3"  "2"  "2" "1"
## [10,] "7"    "2"  "0"  "0"  "2" "1"
## [11,] "8"    "1"  "3"  "2"  "2" "0"
## [12,] "8"    "2"  "0"  "0"  "2" "1"
## [13,] "9"    "1"  "3"  "2"  "2" "1"
## [14,] "9"    "2"  "0"  "0"  "2" "0"
## [15,] "10"   "2"  "0"  "0"  "2" "1"
## [16,] "11"   "1"  "3"  "2"  "2" "1"
## [17,] "11"   "2"  "0"  "0"  "2" "1"
## [18,] "12"   "2"  "0"  "0"  "2" "1"
## [19,] "13"   "2"  "0"  "0"  "2" "0"
## [20,] "14"   "1"  "3"  "2"  "2" "1"

The output can be saved to an object:

##      id.fam id.c id.f id.m sex cc 
## [1,] "1"    "1"  "3"  "2"  "2" "1"
## [2,] "1"    "2"  "0"  "0"  "2" "0"
## [3,] "2"    "2"  "0"  "0"  "2" "0"
## [4,] "3"    "2"  "0"  "0"  "2" "1"
## [5,] "4"    "2"  "0"  "0"  "2" "0"
## [6,] "5"    "1"  "3"  "2"  "2" "1"

With the functions nindiv and nfam, you can get the number of individuals or number of families in your data:

## [1] 1659
## [1] 550

Note that the nfam function assumes that your dataset is from a triad study, i.e., includes at least child and one parent information.

Displaying and extracting genotype information

The function nsnps will tell us how many markers/SNPs there is in the data. Be careful, as a default it assumes that the data is triad data (i.e., mother, father and child were genotyped), so if your data is from a case-control study, be sure to specify that with argument design = "cc".

## [1] 429

To extract and/or show genotypes for specific individuals or markers, use the showGen function:

## opening ff C:/Users/hkgjess/AppData/Local/Temp/RtmpecqmZK/ff3c905b8c7c23.ff
## ff (open) byte length=30 (30) dim=c(5,6) dimorder=c(1,2) levels: G <NA> C T A
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    T    T    A    T    T    T
## [2,]    T    T    T    T    T    T
## [3,]    T    T    A    T    T    T
## [4,]    A    T    A    T    T    T
## [5,]    A    T    T    T    T    T
## ff (open) byte length=2010 (2010) dim=c(201,10) dimorder=c(1,2) levels: G <NA> C T A
##        [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]      G    G    T    A    T    T    G    G    G     G
## [2,]      G    G    A    A    T    T    G    G    G     G
## [3,]      G    G    T    A    T    T    G    G    G     G
## [4,]      G    G    A    A    A    T    G    G    C     G
## [5,]      G    G    T    A    A    T    G    G    C     G
## [6,]      G    G    A    A    A    T    G    G    C     G
## [7,]      G    G    A    A    A    T    G    G    G     G
## [8,]      G    G    A    A    A    T    G    G    G     G
## :         :    :    :    :    :    :    :    :    :     :
## [194,]    G    G    A    A    A    A    G    G    C     G
## [195,]    G    G    A    A    A    A    G    G    G     G
## [196,]    G    G    T    A    A    T    G    G    C     C
## [197,]    G    G    T    A    A    T    G    G    G     G
## [198,]    G    G    T    A    A    T    G    G    G     G
## [199,]    G    G    A    A    T    T    G    G    G     G
## [200,]    C    G    A    A    T    T    G    G    G     G
## [201,]    C    G    A    A    A    T    G    G    C     G

As above, this output can be saved to an object:

## ff (open) byte length=1206 (1206) dim=c(201,6) dimorder=c(1,2) levels: G <NA> C T A
##        [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]     A    T    T    T    A    T 
## [2,]     A    T    T    T    T    T 
## [3,]     A    A    T    T    A    T 
## [4,]     T    T    T    T    A    T 
## [5,]     A    T    T    T    A    T 
## [6,]     T    T    T    T    A    T 
## [7,]     A    T    T    T    T    T 
## [8,]     A    T    T    T    T    T 
## :         :    :    :    :    :    :
## [194,]   A    T    T    T    T    T 
## [195,]   A    A    T    T    T    T 
## [196,]   T    T    T    T    T    T 
## [197,]   A    T    T    T    NA   NA
## [198,]   A    T    T    T    T    T 
## [199,]   A    A    T    T    A    T 
## [200,]   A    A    T    T    T    T 
## [201,]   A    A    T    T    T    T

Note that these functions work only with objects resulting from genDataRead, and not genDataPreprocess, as the preprocessing disturbs the coding of the data, thus making the output not easy to understand.

Preparing your data

After loading the data, it is necessary to pre-process it to the internal format used by haplin. This is done by evoking the command:

my.prepared.gen.data <- genDataPreprocess( data.in = my.gen.data, map.file =
  "my_gen_data.map", design = "triad", file.out = "my_prepared_gen_data",
  dir.out = "." )

CAUTION: This action can be very time-consuming for large datasets (e.g., estimated time for ca.45,000 SNPs and 1,600 individuals, a PED file of ca.700MB, is ca.6 minutes on a 7-core CPU). However, this needs to be done only once and the output, stored in small files, can be used for the subsequent analysis repeatedly. (See also section Choosing a subset of data

This will also create .RData and .ffData files, which take much less space than the input PED files. Be careful not to delete these files, as they can be re-used by simply loading into R (the `genDataLoad function) right before Haplin analysis.

NOTE: The information on the my.prepared.gen.data object can be displayed by simply writing the name of the object.

Choosing a subset of data

If you know that you want to focus your analysis on a certain region of the entire SNP set, or perhaps you’re impatient and want to check out Haplin without waiting a long time for the preprocessing to finish, you can easily choose a subset of the data to pre-process and analyze. This can be done with the command:

gen.data.subset <- genDataGetPart( data.in = my.gen.data, markers = c( 3:15,22 ),
  design = "triad", file.out = "my_gen_data_subset", dir.out = "." )

This function allows you to specify the subset in various ways:

If you give a combination of these parameters, the result will be the intersection of the subsets defined by each of the parameters alone. The subset is then available in the gen.data.subset object and written to file.out file(s). These can be loaded and re-used multiple times.

Re-using the data

IMPORTANT: Remember that each time you start a new R session, you need to load the data into the memory with the command:

my.prepared.gen.data <- genDataLoad( filename = "my_prepared_gen_data",
  dir.in = "." )

This takes much less time than re-reading and running the data preparations! The output of the genDataPreprocess function (or genDataLoad) can then be used to run the analysis.