UNGeneAnno: Complete example

Richard Thompson

2016-09-21

UNGeneAnno is written to enable the rapid collation of gene details from publicly available databases, initially those being the NCBI gene and Uniprot databases.

Further to the original aim, the package also includes the getPublicationList function which will returns a vector of objects detailing the results of a journal search of the NCBI PubMed database.

Nota Bene: The package was originally written to collate gene information from Uniprot and NIH/NCBI databases, thus alleviating repetitive searches. The aim was both to speed up “accessing” this data and to minimise the number of database calls, network load and server traffic. This vignette outlines how this is acheived.

Initial Input from File

A typical workflow begins with a matrix, wherein the first column represents a group identifier, numeric or character, and the second, gene names. The gene names, may include descriptive elements

  f <- matrix(c("1","BRAF.exp","1","BRCA2.mut","2","BRAF.cnv","2","AURKB.mut","2","PTEN.exp")
              ,ncol=2,byrow=TRUE)
  f
##      [,1] [,2]       
## [1,] "1"  "BRAF.exp" 
## [2,] "1"  "BRCA2.mut"
## [3,] "2"  "BRAF.cnv" 
## [4,] "2"  "AURKB.mut"
## [5,] "2"  "PTEN.exp"

The getuniquegenelist function parses second column of the input matrix into a vector of character strings containing only the initial alphanumeric characters, so the above is treat as:

1   BRAF
1   BRCA2
2   BRAF
2   AURKB
2   PTEN

once the matrix is passed to getuniquegenelist, A geneanno object is populated with unique lists of both the group identifiers and gene names:

   geneanno <- getUniqueGeneList(geneanno(),f)
   geneanno@genelist
## [1] "BRAF"  "BRCA2" "AURKB" "PTEN"
   geneanno@groupnos
## [1] "1" "2"

Nota Bene The getuniquegenelist function assumes that all supplied gene names are alphanumeric and anything following is descriptive; therefore the function truncates contents of the second column to the initial alphanumeric characters only.

Getting Gene Objects

Once these lists have been populated, the summary information can be sourced from the databases; returning a vector of gene objects containing the downloaded details for each gene.

genesummaries <- getGeneSummary(geneanno)

Once the details have been downloaded, the gene object is saved to a subdirectory which defaults to “genes” and is created in the working directory; However the main directory and subdirectory can be amended using geneanno@fileroot <- "/path/to/directory" and geneanno@genefilestem <- "directory_name". Prior to downloading, the method will check the subdirectory for a saved gene object younger than seven days and preferentially use any saved objects it finds.

Producing Output Files

The final task is to produce output files for each group identifier listing details of the genes related to it in the original input matrix.

groupgenelist <- getGroupGeneList(geneanno,matrix)
produceOutputFiles(geneanno, groupgenelist, genesummaries)

Optionally, the original matrix can be used to search the NCBI PubMed database and the returned article information incorporated into the output files. Theauthor would like to advise against using the searchPublications function unless the group identifiers are meaningful, for example if they are drug names. Further details can be found in the “UNGeneAnno: PubMed Journal Query example” vignette.

publicationmatrix <- searchPublications(matrix)
produceOutputFiles(geneanno, groupgenelist, genesummaries, publicationmatrix)

Output files are saved in a subdirectory, which defaults to “gene_annotations” of the geneanno@fileroot unless an alternative directory has been provided by geneanno@outputstem <- "directory_name". Currently, the files are created as plain text files, named as the group identifiers, containing the appropriate details stored in the gene objects.