Gene set enrichment analysis
Gene set enrichment analysis is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with different phenotypes. The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes, which are used for the analysis.
Researchers performing high-throughput experiments that yield sets of genes often want to retrieve a functional profile of that gene set, in order to better understand the underlying biological processes. This can be done by comparing the input gene set to each of the bins in the gene ontology – a statistical test can be performed for each bin to see if it is enriched for the input genes.
Background
After the completion of the Human Genome Project, the problem of how to interpret and analyze it remained. In order to seek out genes associated with diseases, DNA microarrays were used to measure the amount of gene expression in different cells. Microarrays on thousands of different genes were carried out, and comparisons the results of two different cell categories, e.g. normal cells versus cancerous cells. However, this method of comparison is not sensitive enough to detect the subtle differences between the expression of individual genes, because diseases typically involve entire groups of genes. Multiple genes are linked to a single biological pathway, and so it is the additive change in expression within gene sets that leads to the difference in phenotypic expression. Gene Set Enrichment Analysis was developed to focus on the changes of expression in groups of a priori defined gene sets. By doing so, this method resolves the problem of the undetectable, small changes in the expression of single genes.Methods
Gene set enrichment analysis uses a priori gene sets that have been grouped together by their involvement in the same biological pathway, or by proximal location on a chromosome. A database of these predefined sets can be found at the Molecular signatures database. In GSEA, DNA microarrays, or now RNA-Seq, are still performed and compared between two cell categories, but instead of focusing on individual genes in a long list, the focus is put on a gene set. Researchers analyze whether the majority of genes in the set fall in the extremes of this list: the top and bottom of the list correspond to the largest differences in expression between the two cell types. If the gene set falls at either the top or bottom, it is thought to be related to the phenotypic differences.In the method that is typically referred to as standard GSEA, there are three steps involved in the analytical process. The general steps are summarized below:
- Calculate the enrichment score that represents the amount to which the genes in the set are over-represented at either the top or bottom of the list. This score is a Kolmogorov–Smirnov-like statistic.
- Estimate the statistical significance of the ES. This calculation is done by a phenotypic-based permutation test in order to produce a null distribution for the ES. The P value is determined by comparison to the null distribution.
- *Calculating significance this way tests for the dependence of the gene set on the diagnostic/phenotypic labels
- Adjust for multiple hypothesis testing for when a large number of gene sets are being analyzed at one time. The enrichment scores for each set are normalized and a false discovery rate is calculated.
Where is the rank of the gene, is the power usually set to 1.
Limitations and proposed alternatives
SEA
When GSEA was first proposed in 2003 some immediate concerns were raised regarding its methodology. These criticisms led to the use of the correlation-weighted Kolmogorov–Smirnov test, the normalized ES, and the false discovery rate calculation, all of which are the factors that currently define standard GSEA. However, GSEA has now also been criticized for the fact that its null distribution is superfluous, and too difficult to be worth calculating, as well as the fact that its Kolmogorov–Smirnov-like statistic is not as sensitive as the original. As an alternative, the method known as Simpler Enrichment Analysis, was proposed. This method assumes gene independence and uses a simpler approach to calculate t-test. However, it is thought that these assumptions are in fact too simplifying, and gene correlation cannot be disregarded.SGSE
One other limitation to Gene Set Enrichment Analysis is that the results are very dependent on the algorithm that clusters the genes, and the number of clusters being tested. Spectral Gene Set Enrichment is a proposed, unsupervised test. The method's founders claim that it is a better way to find associations between MSigDB gene sets and microarray data. The general steps include:1. Calculating the association between principal components and gene sets.
2. Using the weighted Z-method to calculate the association between the gene sets and the spectral structure of the data.
Tools
GSEA uses complicated statistics, so it requires a computer program to run the calculations. GSEA has become standard practice, and there are many websites and downloadable programs that will provide the data sets and run the analysis.MOET
Multi-Ontology Enrichment Tool is a web-based ontology analysis tool that provides functionality for multiple ontologies, including Disease, GO, Pathway, Phenotype, and Chemical entities for multiple species, including rat, mouse, human, bonobo, squirrel, dog, pig, chinchilla, naked mole-rat and vervet. It outputs a downloadable graph and a list of statistically overrepresented terms in the user's list of genes using hypergeometric distribution. MOET also displays the corresponding Bonferroni correction and odds ratio on the results page. It is simple to use, and results are provided with a few clicks in seconds; no software installations or programming skills are required. In addition, MOET is updated weekly, providing the user with the most recent data for analyses.NASQAR
NASQAR is an open source, web-based platform for high-throughput sequencing data analysis and visualization. GSEA can be run using the R-based clusterProfiler package. NASQAR currently supports GO Term and KEGG Pathway enrichment with all organisms supported by an Org.Db database.PlantRegMap
The gene ontology annotation for 165 plant species and GO enrichment analysis is available.MSigDB
The Molecular Signatures Database hosts an extensive collection of annotated gene sets that can be used with most GSEA Software.Broad Institute
The Broad Institute website is in cooperation with MSigDB and has a downloadable GSEA software, as well a general tutorial.WebGestalt
WebGestalt is a web based gene set analysis toolkit. It supports three well-established and complementary methods for enrichment analysis, including Over-Representation Analysis, Gene Set Enrichment Analysis, and Network Topology-based Analysis. Analysis can be performed against 12 organisms and 321,251 functional categories using 354 gene identifiers from various databases and technology platforms.Enrichr
Enrichr is a gene set enrichment analysis tool for mammalian gene sets. It contains background libraries for transcription regulation, pathways and protein interactions, ontologies including GO and the human and mouse phenotype ontologies, signatures from cells treated with drugs, gene sets associated with human diseases, and expression of genes in different cells and tissues. The background libraries are from over 200 resources and contain over 450,000 annotated gene sets. The tool can be accessed through API and provides different ways to visualize the results.GeneSCF
GeneSCF is a real-time based functional enrichment tool with support for multiple organisms and is designed to overcome the problems associated with using outdated resources and databases. Advantages of using GeneSCF: real-time analysis, users do not have to depend on enrichment tools to get updated, easy for computational biologists to integrate GeneSCF with their NGS pipeline, it supports multiple organisms, enrichment analysis for multiple gene list using multiple source database in single run, retrieve or download complete GO terms/Pathways/Functions with associated genes as simple table format in a plain text file.DAVID
is the database for annotation, visualization and integrated discovery, a bioinformatics tool that pools together information from most major bioinformatics sources, with the aim of analyzing large gene lists in a high-throughput manner. DAVID goes beyond standard GSEA with additional functions like switching between gene and protein identifiers on the genome-wide scale, however, the annotations used by DAVID was not updated since October 2016 to Dec 2021, which can have a considerable impact on practical interpretation of results. However, A most recent update was performed in 2021Metascape
is a biologist-oriented gene-list analysis portal. Metascape integrates pathway enrichment analysis, protein complex analysis, and multi-list meta-analysis into one seamless workflow accessible through a significantly simplified user interface. Metascape maintains analysis accuracy by updating its 40 underlying knowledgebases monthly. Metascape presents results using easy-to-interpret graphics, spreadsheets, and publication quality presentations, and is freely available.AmiGO 2
The Gene Ontology consortium has also developed their own online GO term enrichment tool,allowing species-specific enrichment analysis versus the complete database, coarser-grained GO slims, or custom references.