GSEA GenePattern Module Documentation

A GenePattern module for running the GSEA method

GSEA (v20.4.x)

Gene Set Enrichment Analysis

Author: Aravind Subramanian, Pablo Tamayo, David Eby; Broad Institute

Contact:

See the GSEA forum for GSEA questions.

Contact the GenePattern team for GenePattern issues.

GSEA Version: 4.3.x

Description

Evaluates a genomewide expression profile and determines whether a priori defined sets of genes show statistically significant, cumulative changes in gene expression that are correlated with a phenotype. The phenotype may be categorical (e.g., tumor vs. normal) or continuous (e.g., a numerical profile across all samples in the expression dataset).

Summary

Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. It evaluates cumulative changes in the expression of groups of multiple genes defined based on prior biological knowledge. It first ranks all genes in a data set, then calculates an enrichment score for each gene set, which reflects how often members of that gene set occur at the top or bottom of the ranked data set (for example, in expression data, in either the most highly expressed genes or the most underexpressed genes).

Introduction

Microarray experiments profile the expression of tens of thousands of genes over a number of samples that can vary from as few as two to several hundreds. One common approach to analyzing these data is to identify a limited number of the most interesting genes for closer analysis. This usually means identifying genes with the largest changes in their expression values based on a t-test or similar statistic, and then picking a significance cutoff that will trim the list of interesting genes down to a handful of genes for further research.

Gene Set Enrichment Analysis (GSEA) takes an alternative approach to analyzing genomic data: it focuses on cumulative changes in the expression of multiple genes as a group, which shifts the focus from individual genes to groups of genes. By looking at several genes at once, GSEA can identify pathways whose several genes each change a small amount, but in a coordinated way. This approach helps reflect many of the complexities of co-regulation and modular expression.

GSEA therefore takes as input two distinct types of data for its analysis:

The GSEA GenePattern module uses either categorical or continuous phenotype data for its analysis. In the case of a categorical phenotype, a dataset would contain two different classes of samples, such as “tumor” and “normal.” In the case of a continuous phenotype, a dataset would contain a numerical value for each sample. Examples of numerical profiles include the expression level of a specific gene or a measure of cell viability over the course of a time series experiment. The GSEA desktop application, available on the GSEA website, has additional functionalities. For instance, the GSEA desktop application can conduct an enrichment analysis against a ranked list of genes, or analyze the leading-edge subsets within each gene set. Many of these capabilities are also available in separate GP modules (see GSEAPreranked and GSEALeadingEdgeViewer).

If you are using GSEA on RNA-seq data, please read these guidelines.

Algorithm

GSEA first ranks the genes based on a measure of each gene’s differential expression with respect to the two phenotypes (for example, tumor versus normal) or correlation with a continuous phenotype. Then the entire ranked list is used to assess how the genes of each gene set are distributed across the ranked list. To do this, GSEA walks down the ranked list of genes, increasing a running-sum statistic when a gene belongs to the set and decreasing it when the gene does not. A simplified example is shown in the following figure.

The enrichment score (ES) is the maximum deviation from zero encountered during that walk. The ES reflects the degree to which the genes in a gene set are overrepresented at the top or bottom of the entire ranked list of genes. A set that is not enriched will have its genes spread more or less uniformly through the ranked list. An enriched set, on the other hand, will have a larger portion of its genes at one or the other end of the ranked list. The extent of enrichment is captured mathematically as the ES statistic.

Next, GSEA estimates the statistical significance of the ES by a permutation test. To do this, GSEA creates a version of the data set with phenotype labels randomly scrambled, produces the corresponding ranked list, and recomputes the ES of the gene set for this permuted data set. GSEA repeats this many times (1000 is the default) and produces an empirical null distribution of ES scores. Alternatively, permutations may be generated by creating “random” gene sets (genes randomly selected from those in the expression dataset) of equal size to the gene set under analysis.

The nominal p-value estimates the statistical significance of a single gene set’s enrichment score, based on the permutation-generated null distribution. The nominal p-value is the probability under the null distribution of obtaining an ES value that is as strong or stronger than that observed for your experiment under the permutation-generated null distribution.

Typically, GSEA is run with a large number of gene sets. For example, the MSigDB collection and subcollections each contain hundreds to thousands of gene sets. This has implications when comparing enrichment results for the many sets:

The ES must be adjusted to account for differences in the gene set sizes and in correlations between gene sets and the expression data set. The resulting normalized enrichment scores (NES) allow you to compare the analysis results across gene sets.

The nominal p-values need to be corrected to adjust for multiple hypothesis testing. For a large number of sets (rule of thumb: more than 30), we recommend paying attention to the False Discovery Rate (FDR) q-values: consider a set significantly enriched if its NES has an FDR q-value below 0.25.

For more information, see http://www.gsea-msigdb.org/gsea.

Known Issues

File names

Input expression datasets with the character ‘-‘ or spaces in their file names causes GSEA to error.

CLS Files

The GSEA GenePattern module interprets the sample labels in categorical CLS files by their order of appearance, rather than via their numerical value, unlike some other GenePattern modules. For example, in the CLS file below:

13 2 1
# resistant sensitive
1 1 1 1 1 1 1 1 0 0 0 0 0

Most other GenePattern modules would interpret the first 8 samples to be sensitive and the remaining 5 to be resistant. However, GSEA assigns resistant to the first 8 samples and sensitive to the rest. This is because GSEA assigns the first name in the second line to the first symbol found on the third line.

If the sample labels are in numerical order, as below, no difference in behavior will be noted.

13 2 1
# resistant sensitive
0 0 0 0 0 1 1 1 1 1 1 1 1 

References

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS. 2005;102(43);15545-15550. (Link)

Mootha VK, Lindgren CM, Eriksson K-F, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesivor JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC.  PGC-1-α responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34:267-273. (Link)

GSEA User Guide: http://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html

GSEA website: http://www.gsea-msigdb.org/

This version of the module is based on the GSEA v4.1.x code base. See the Release Notes for new features and other notable changes.

Parameters

* = required

Input Files

  1. expression dataset: GCT or RES file

This file contains the expression dataset.

  1. gene sets database: GMT, GMX, or GRP file.

Gene set files, either your own or from the listed MSigDB files.

  1. phenotype labels: CLS file

The GSEA module supports two kinds of class (CLS) files: categorical phenotype and continuous phenotype.

A categorical phenotype CLS file must define a single phenotype having two categorical labels, such as tumor and normal.

A continuous phenotype CLS may define multiple phenotypes. Each phenotype definition assigns a numerical value for each sample. This series of values defines the phenotype profile. For example,

4. chip platform: an optional CHIP file may be provided if you do not select a chip platform from the drop-down

Output Files

  1. Enrichment Report archive: ZIP

ZIP file containing the result files. For more information on interpreting these results, see Interpreting GSEA Results in the GSEA User Guide. Note that in prior versions the ZIP bundle was created as the only output file. This behavior has been changed to give direct access to the results without the need for a download.

  1. Enrichment Report: HTML and PNG images

The GSEA Enrichment Report. As above, see the GSEA User Guide for more info.

  1. Optional SVG images (compressed)

Identical to the PNGs in the Enrichment Report, but in SVG format for higher resolution. These are GZ compressed to reduce space usage; they can be decompressed using ‘gunzip’ on Mac or Linux and 7-Zip on Windows

  1. Optional GCTs

The datasets backing all the heatmap images from the Enrichment Report for use in external visualizers or analysis tools. These will have the same name as the corresponding image but instead with a GCT extension. When Collapse or Remap_Only is set, the collapsed dataset is also saved as a GCT. These files will be created if the Create GCTs option is true.

Platform Dependencies

Task Type:
Gene List Selection

CPU Type:
any

Operating System:
any

Language:
Java

Version Comments

Copyright © 2003-2022 Broad Institute, Inc., Massachusetts Institute of Technology, and Regents of the University of California. All rights reserved.