1 Enrichment tests using OBOCollection objects

1.1 Import an ontology as an OBOCollection object

The OBO format is a widely used format to store ontologies. Files in this format can be read into R using the getOBOCollection() function from the GSEABase package, which returns an object of the class OBOCollection, defined within GSEABase.

Part of the updates to the Category and GOstats packages, described here, have the purpose of enabling functional enrichment analyses performed with these packages using ontologies stored as OBOCollection objects.

This facilitates, for instance, enrichment analyses with the widely adopted Human Phenotype Ontology (HPO). Importing this ontology into R can be done as follows.

library(GSEABase)

if (!file.exists("oboHPO.rds")) {
  download.file("https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/hp.obo", "hp.obo", method="curl")
  oboHPO <- getOBOCollection("hp.obo")
  saveRDS(oboHPO, file="oboHPO.rds")
} else
  oboHPO <- readRDS("oboHPO.rds")
oboHPO
collectionType: OBO 
  ids: HP:0000001, HP:0000002, ..., HP:3000079 (12786 total)
  evidenceCode: EXP IDA IPI IMP IGI IEP ISS ISO ISA ISM IGC IBA IBD IKR IRD RCA TAS NAS IC ND IEA 
  ontology: NA 
  subsets: hposlim_core (Core clinical terminology),
    secondary_consequence (Consequence of a disorder in another
    organ system.) (2 total)

1.2 Import ontology annotations on genes in a GeneSetCollection object

We download HPO annotations on genes and store them into a GeneSetCollection object. This class is also defined in the GSEABase package. HPO annotations can be downloaded from the file ALL_SOURCES_ALL_FREQUENCIES_phenotype_to_genes.txt located in this URL.

if (!file.exists("gscHPO.rds")) {
  download.file("http://compbio.charite.de/jenkins/job/hpo.annotations.monthly/lastStableBuild/artifact/annotation/ALL_SOURCES_ALL_FREQUENCIES_phenotype_to_genes.txt", "ALL_SOURCES_ALL_FREQUENCIES_phenotype_to_genes.txt", method="curl")
  hpo_p2g <- read.table("ALL_SOURCES_ALL_FREQUENCIES_phenotype_to_genes.txt",
                        sep="\t", quote="", colClasses="character")
  hpo_terms <- unique(as.matrix(hpo_p2g[, 1:2]))
  hpo_terms <- do.call("names<-", list(hpo_terms[, 2], hpo_terms[, 1]))
  genesByHPO <- split(hpo_p2g[, 3], hpo_p2g[, 1])
  gscHPO <- do.call("GeneSetCollection",
                    mapply(function(geneSetID, geneIDs, hpoterms) {
                             GeneSet(EntrezIdentifier("org.Hs.eg.db"),
                                     geneIds=geneIDs,
                                     setName=geneSetID,
                                     shortDescription=hpoterms[geneSetID])
                           }, names(genesByHPO),
                           genesByHPO,
                           MoreArgs=list(hpoterms=hpo_terms),
                           USE.NAMES=FALSE))
  saveRDS(gscHPO, file="gscHPO.rds")
} else
  gscHPO <- readRDS("gscHPO.rds")
gscHPO
GeneSetCollection
  names: HP:0000002, HP:0000003, ..., HP:3000077 (7606 total)
  unique identifiers: 8192, 8195, ..., 735 (3498 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: NullCollection (1 total)

1.3 Enrichment tests using OBOCollection and GeneSetCollection objects

The proposed update to the Category and GOstats packages can be used to perform a HPO enrichment analysis using OBOCollection and GeneSetCollection objects, in the same way the current version of the packages allow us to do it with the GO ontology.

We illustrate this functionality with a toy example using the entire human gene set, defined in the organism-level package org.Hs.eg.db, as the gene universe, and genes annotated to the HPO term Meningitis.

HPOmeningitis <- gscHPO[[grep("Meningitis", sapply(gscHPO, description))]]
HPOmeningitis
setName: HP:0001287 
geneIds: 1536, 3586, ..., 1535 (total: 61)
geneIdType: EntrezId (org.Hs.eg.db)
collectionType: Null 
details: use 'details(object)'
description(HPOmeningitis)
  HP:0001287 
"Meningitis" 
head(geneIds(HPOmeningitis))
[1] "1536" "3586" "3718" "3592" "5896" "5897"
length(geneIds(HPOmeningitis))
[1] 61

Simulate a set of 100 differentially expressed genes by selecting 5 genes uniformly at random among the previous ones annotated to the Meningitis HPO term and 95 more also uniformly at random among the rest of human genes. Consider an initial gene universe of 10,000 genes among which 30 are annotated to the Meningitis HPO term. This should make the Meningitis HPO term significantly enriched with differentially expressed genes, as we shall se below.

library(org.Hs.eg.db)

N <- 10000 ## total number of genes
m <- 30 ## total number of genes in enriched gene set
k <- 5 ## number of DE genes enriching the gene set
n <- 100 ## total number of DE genes
set.seed(123)
k.genes <- sample(geneIds(HPOmeningitis), size=k, replace=FALSE)
m.notk.genes <- setdiff(geneIds(HPOmeningitis), k.genes)[1:(m-k)]
N.genes <- c(sample(setdiff(keys(org.Hs.eg.db), geneIds(HPOmeningitis)),
                    size=N-m, replace=FALSE),
             m.notk.genes, k.genes)
n.notk.genes <- sample(setdiff(N.genes, m.notk.genes), size=n-k)
DEgenes <- c(n.notk.genes, k.genes)
length(DEgenes)
[1] 100

Now perform an HPO enrichment analysis, using the updated version of the Category and GOstats packages.

library(Category)
library(GOstats)

OBOparams <- new("OBOHyperGParams", geneIds=DEgenes,
                 universeGeneIds=N.genes, pvalueCutoff=0.05,
                 conditional=FALSE, testDirection="over",
                 datPkg=OBOCollectionDatPkg(oboHPO, gscHPO))
OBOparams
A OBOHyperGParams instance
  category: OBO 
annotation: OBO 
hgOverHPO <- hyperGTest(OBOparams)
hgOverHPO
Gene to OBO  test for over-representation 
4630 OBO ids tested (147 have p < 0.05)
Selected gene set size: 8 
    Gene universe size: 580 
    Annotation package: OBO 
head(summary(hgOverHPO))
       OBOID       Pvalue OddsRatio  ExpCount Count Size           Term
1 HP:0001287 1.328824e-05  36.46667 0.4137931     5   30     Meningitis
2 HP:0011450 1.577236e-05  35.00000 0.4275862     5   31  CNS infection
3 HP:0002383 2.710561e-04  42.30000 0.1517241     3   11   Encephalitis
4 HP:0002024 1.352031e-03  14.05263 0.5793103     4   42  Malabsorption
5 HP:0000509 2.045368e-03  18.46667 0.2896552     3   21 Conjunctivitis
6 HP:0025337 2.045368e-03  18.46667 0.2896552     3   21        Red eye

2 Enrichment tests conditioning on odds ratio and gene set size

Enrichment tests may lead to a large number of significantly enriched gene sets, often due to redundancy derived from the parent-child relationships in the ontology. Different strategies are applied to focus on the more interesting gene sets, such as discarding gene sets below a certain odds ratio cutoff or those that are considered to be too small or too large.

Another effective way to reduce redundancy is by using condtional hypergeometric tests, setting conditional=TRUE in the parameters object. This removes the enriching genes from a gene set that meets the p-value cutoff, from the parent gene sets before testing them. However, when using this strategy a parent gene set may not be significant anymore after the enriching genes from the child gene set are remove, but the child gene set itself may also be discarded afterwards when it does not meet a cutoff on the odds ratio or the gene set size.

For this reason it makes sense to incorporate the consideration of odds ratio and gene set size cutoffs during the conditoning tests, such that only when the gene set meets all cutoffs on p-value, odds ratio and size, is consdered to be enriched and its enriching genes removed from the parent gene sets.

To illustrate this feature consider the following simulation in which the HPO term “Meningitis” is somewhat less enriched as in the previous simulation and perform conditional hypergeometric tests.

N <- 10000
m <- 60
k <- 15
n <- 1470
set.seed(123)
k.genes <- sample(geneIds(HPOmeningitis), size=k, replace=FALSE)
m.notk.genes <- setdiff(geneIds(HPOmeningitis), k.genes)[1:(m-k)]
N.genes <- c(sample(setdiff(keys(org.Hs.eg.db), geneIds(HPOmeningitis)),
                    size=N-m, replace=FALSE),
             m.notk.genes, k.genes)
n.notk.genes <- sample(setdiff(N.genes, m.notk.genes), size=n-k)
DEgenes <- c(n.notk.genes, k.genes)
length(DEgenes)
[1] 1470
OBOparams <- new("OBOHyperGParams", geneIds=DEgenes,
                 universeGeneIds=N.genes, pvalueCutoff=0.05,
                 conditional=TRUE, testDirection="over",
                 datPkg=OBOCollectionDatPkg(oboHPO, gscHPO))
Warning in makeValidParams(.Object): removing duplicate IDs in geneIds
OBOparams
A OBOHyperGParams instance
  category: OBO 
annotation: OBO 
hgOverHPO <- hyperGTest(OBOparams)
hgOverHPO
Gene to OBO Conditional test for over-representation 
4680 OBO ids tested (77 have p < 0.05)
Selected gene set size: 99 
    Gene universe size: 603 
    Annotation package: OBO 
summary(hgOverHPO)[summary(hgOverHPO)$OBOID %in% "HP:0001287", ]
        OBOID     Pvalue OddsRatio ExpCount Count Size       Term
76 HP:0001287 0.04867252  1.821429 9.850746    15   60 Meningitis

Find out the HPO terms that are connected with Meningitis.

library(RBGL)

g <- as(oboHPO, "graphNEL")
paths <- dijkstra.sp(g=g, start="HP:0001287")
vtc <- names(paths$distances[is.finite(paths$distances)])
vtc <- intersect(vtc, names(gscHPO))
vtc
[1] "HP:0000707" "HP:0001287" "HP:0002011" "HP:0002715" "HP:0010978"
[6] "HP:0011450" "HP:0012639" "HP:0012647" "HP:0012649"
sapply(gscHPO, description)[vtc]
                                               HP:0000707 
                      "Abnormality of the nervous system" 
                                               HP:0001287 
                                             "Meningitis" 
                                               HP:0002011 
"Morphological abnormality of the central nervous system" 
                                               HP:0002715 
                       "Abnormality of the immune system" 
                                               HP:0010978 
                "Abnormality of immune system physiology" 
                                               HP:0011450 
                                          "CNS infection" 
                                               HP:0012639 
               "Abnormality of nervous system morphology" 
                                               HP:0012647 
                         "Abnormal inflammatory response" 
                                               HP:0012649 
                        "Increased inflammatory response" 

Plot the subgraph that involves these HPO terms.

The subgraph involving the HPO term Meningitis

Figure 1: The subgraph involving the HPO term Meningitis

Examine the p-values and odds ratios of these HPO terms.

pvalues(hgOverHPO)[vtc]
HP:0000707 HP:0001287 HP:0002011 HP:0002715 HP:0010978 HP:0011450 HP:0012639 
0.40847744 0.04867252 0.23204461 0.18760545 0.14362320 1.00000000 0.28614132 
HP:0012647 HP:0012649 
0.11815850 0.48107980 
oddsRatios(hgOverHPO)[vtc]
HP:0000707 HP:0001287 HP:0002011 HP:0002715 HP:0010978 HP:0011450 HP:0012639 
  1.101449   1.821429   1.206600   1.250663   1.315456   0.000000   1.165465 
HP:0012647 HP:0012649 
  1.391355   1.067204 

The Meningitis HPOterm (HP:0001287) has a p-value < 0.05 and odds ratio of about 1.8. Now, run again the conditional hypergeometric test, this time using also cutoffs on the odds ratio, minimum and maximum gene set size.

OBOparams <- new("OBOHyperGParams", geneIds=DEgenes,
                 universeGeneIds=N.genes, pvalueCutoff=0.05,
                 orCutoff=2, minSizeCutoff=3, maxSizeCutoff=300,
                 conditional=TRUE, testDirection="over",
                 datPkg=OBOCollectionDatPkg(oboHPO, gscHPO))
Warning in makeValidParams(.Object): removing duplicate IDs in geneIds
OBOparams
A OBOHyperGParams instance
  category: OBO 
annotation: OBO 
hgOverHPO2 <- hyperGTest(OBOparams)
hgOverHPO2
Gene to OBO Conditional test for over-representation 
4680 OBO ids tested (84 have p < 0.05)
Selected gene set size: 99 
    Gene universe size: 603 
    Annotation package: OBO 

Now, examine again p-values and odds ratios and notice how did they improved for the parents of the Meninigitis HPO term, (HP:0012649 and HP:0011450) because we set a cutoff of odds ratio > 2, which is not met by Meningitis, and therefore, despite its p-value < 0.05, its genes are not removed from the parent HPO terms because it does not meet the odds ratio cutoff. In an ideal example, at least one of these two parent HPO terms would become significant, but we can already see the influence of this strategy.

pvalues(hgOverHPO2)[vtc]
HP:0000707 HP:0001287 HP:0002011 HP:0002715 HP:0010978 HP:0011450 HP:0012639 
0.40847744 0.04867252 0.23204461 0.18760545 0.14362320 0.06337105 0.28614132 
HP:0012647 HP:0012649 
0.11815850 0.11815850 
oddsRatios(hgOverHPO2)[vtc]
HP:0000707 HP:0001287 HP:0002011 HP:0002715 HP:0010978 HP:0011450 HP:0012639 
  1.101449   1.821429   1.206600   1.250663   1.315456   1.736322   1.165465 
HP:0012647 HP:0012649 
  1.391355   1.391355 

3 List of changes made to the Cateogry and GOstats packages

Here I list the changes made to the Category and GOstats packages:

  • modified Category/NAMESPACE
    import class OBOCollection from GSEABase.

  • modified Category/R/AllClasses.R
    add class OBOCollectionDatPkg, inherited from DatPkg, differently to AffyDatPkg or YeastDatPkg it needs a slot for a graph object (oboGraph). This may be related to a comment written at the end of Category/R/AllClasses by ML (?) where it is raised whether we need a class for every ontology. In principle this OBO thing would cover all that.

  • modified Category/R/DatPkg-accessors.R
    added constructor function ‘OBOCollectionDatPkg’ to build ‘OBOCollectionDatPkg’ objects. probably this could be merged somewhow with ‘GeneSetCollectionDatPkg’. in fact ‘GeneSetCollectionDatPkg’ is not even expoted.

  • modified Category/man/DatPkg-class.Rd
    added documentation on the ‘OBOCollectionDatPkg’ class

  • modified Category/NAMESPACE
    add export of the ‘OBOCollectionDatPkg’ class.
    add export of ‘OBOCollectionDatPkg’ constructor.

  • modified Category/R/hyperGParams-accessors.R
    add ‘conditional’ getter and setter for ‘OBOHyperGParams’.

  • modified Category/R/universeBuilder-methods.R
    add ‘universeBuilder’ method.

  • modified Category/R/categoryToEntrezBuilder-methods.R
    add ‘categoryToEntrezBuilder’ method.

  • modified GOstats/R/AllClasses.R
    add ‘OBOHyperGResult’ class.

  • added GOstats/R/OBOHyperGResult-accessors.R
    add accessors for OBOHyperGResult: pvalues(), oddsRatios(), expectedCounts(), geneIdUniverse().

  • modified GOstats/R/hyperGtable.R
    add summary() method for ‘OBOHyperGResult’.

  • modified GOstats/R/hyperGTest-methods.R
    added function getGoToChildOBOGraph().

  • modified Category/man/categoryToEntrezBuilder.Rd
    added documentation.

  • modified Category/man/universeBuilder.Rd
    added documentation.

  • added Category/man/OBOHyperGParams-class.Rd
    added documentation.

  • modified GOstats/R/shortestPath.R
    removed importing functionality from the package ‘SparseM’ via ‘require()’.

  • modified GOstats/NAMEPSPACE
    imported ‘as.matrix.csr()’ from the package ‘SparseM’.

  • modified GOstats/DESCRIPTION
    moved ‘sparseM’ from ‘Suggests’ to ‘Imports’ in DESCRIPTION.

  • modified GOHyperGResult-accessors.R
    removed importing functionality from the package ‘Rgraphviz’ via ‘require()’.

  • modified GOstats/NAMESPACE
    imported ‘makeNodeAttrs()’ and ‘plot’ from the package ‘Rgraphviz’.

  • modified GOstats/DESCRIPTION
    moved ‘Rgraphviz’ from ‘Suggests’ to ‘Imports’ in DESCRIPTION.

  • modified GOstats/R/hyperGTest-methods.R
    updated ‘removeSigKidGenes()’ function to take parameters on minimum odds ratio, minimum and maximum gene set size, and use them when to consider “significant” terms and remove genes in in those terms when they are children from a parent term.
    added a method ‘hyperGTest’ for signature ‘OBOHyperGParams’ and updated the private function ‘.hyperGTestInternal()’ to perform the calculations with ‘OBOHyperGParams’ and taking into account the odds ratio and minimum and maximum size of gene sets.

4 Session information

sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin16.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /opt/R/R-3.4.0_BioCdevel/lib/R/lib/libRblas.dylib
LAPACK: /opt/R/R-3.4.0_BioCdevel/lib/R/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grid      stats4    parallel  methods   stats     graphics  grDevices
 [8] utils     datasets  base     

other attached packages:
 [1] Rgraphviz_2.21.0     RBGL_1.53.0          GOstats_2.43.1      
 [4] Category_2.43.2      Matrix_1.2-10        org.Hs.eg.db_3.4.1  
 [7] GSEABase_1.39.0      graph_1.55.0         annotate_1.55.0     
[10] XML_3.98-1.9         AnnotationDbi_1.39.2 IRanges_2.11.12     
[13] S4Vectors_0.15.5     Biobase_2.37.2       BiocGenerics_0.23.0 
[16] knitr_1.16           BiocStyle_2.5.8     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12           highr_0.6              compiler_3.4.0        
 [4] bitops_1.0-6           tools_3.4.0            digest_0.6.12         
 [7] bit_1.1-12             lattice_0.20-35        RSQLite_2.0           
[10] evaluate_0.10.1        memoise_1.1.0          tibble_1.3.3          
[13] pkgconfig_2.0.1        rlang_0.1.1            DBI_0.7               
[16] yaml_2.1.14            SparseM_1.77           genefilter_1.59.0     
[19] stringr_1.2.0          rprojroot_1.2          bit64_0.9-7           
[22] survival_2.41-3        rmarkdown_1.6          bookdown_0.4          
[25] GO.db_3.4.1            blob_1.1.0             magrittr_1.5          
[28] splines_3.4.0          backports_1.1.0        codetools_0.2-15      
[31] htmltools_0.3.6        AnnotationForge_1.19.4 xtable_1.8-2          
[34] stringi_1.1.5          RCurl_1.95-4.8