GSEAPreranked (v6.0.12)

Runs the gene set enrichment analysis against a user-supplied ranked list of genes.

Author: Chet Birger, David Eby; Broad Institute

Contact:

Contact the GenePattern team for GenePattern issues.

See the GSEA forum for GSEA questions.

GSEA Version: 3.0

Introduction

GSEAPreranked runs Gene Set Enrichment Analysis (GSEA) against a user-supplied, ranked list of genes.  It determines whether a priori defined sets of genes show statistically significant enrichment at either end of the ranking.  A statistically significant enrichment indicates that the biological activity (e.g., biomolecular pathway) characterized by the gene set is correlated with the user-supplied ranking.

Details

Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data.  It evaluates cumulative changes in the expression of groups of multiple genes defined based on prior biological knowledge. 

The GSEAPreranked module can be used to conduct gene set enrichment analysis on data that do not conform to the typical GSEA scenario. For example, it can be used when the ranking metric choices provided by the GSEA module are not appropriate for the data, or when a ranked list of genomic features deviates from traditional microarray expression data (e.g., GWAS results, ChIP-Seq, RNA-Seq, etc.).

The user provides GSEAPreranked with a pre-ranked gene list.  Paired with each gene in the list is the numeric ranking statistic, which GSEAPreranked uses to rank order genes in descending order. GSEAPreranked calculates an enrichment score for each gene set.  A gene set’s enrichment score reflects how often members of that gene set occur at the top or bottom of the ranked data set (for example, in expression data, in either the most highly expressed genes or the most underexpressed genes).

The ranked list must not contain duplicate ranking values.

Duplicate ranking values may lead to arbitrary ordering of genes and to erroneous results.  Therefore, it is important to make sure that the ranked list contains no duplicate ranking values.

Permutation test

In GSEAPreranked, permutations are always done by gene set. In standard GSEA, you can choose to set the parameter Permutation type to phenotype (the default) or gene set, but GSEAPreranked does not provide this option.

Understand and keep in mind how GSEAPreranked computes enrichment scores.

The GSEA PNAS 2005 paper introduced a method where a running sum statistic is incremented by the absolute value of the ranking metric when a gene belongs to the set. This method has proven to be efficient and facilitates intuitive interpretation of ranking metrics that reflect correlation of gene expression with phenotype. In the case of GSEAPreranked, you should make sure that this weighted scoring scheme applies to your choice of ranking statistic. If in doubt, we recommend using a more conservative scoring approach by setting scoring scheme parameter to classic; however, the scoring scheme parameter’s default value is weighted, the default value employed by the GSEA module.  Please refer to the GSEA PNAS 2005 paper for further details.

References

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS. 2005;102(43);15545-15550. (link)

Mootha VK, Lindgren CM, Eriksson K-F, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesivor JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC.  PGC-1-α responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34:267-273. (link)

GSEA User Guide: http://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html

GSEA website: http://www.gsea-msigdb.org/

This version of the module is based on the new open-source GSEA v3.0 code base. See the Release Notes for new features and other notable changes, and this link for some notes on new feature usage. The algorithm is identical to the older v2.x code.

Parameters

NOTE: Certain parameters are considered to be "advanced"; that is, they control details of the GSEAPreranked algorithm that are typically not changed. You should not override the default values unless you are conversant with the algorithm.  These parameters are marked "Advanced" in the parameter descriptions.

Name Description
ranked list * This is a file in RNK format that contains the rank ordered gene (or feature) list.
gene sets database *

This parameter's drop-down allows you to select gene sets from the Molecular Signatures Database (MSigDB) on the GSEA website.  This drop-down provides access to only the most current version of MSigDB.  You can also upload your own gene set file(s) in GMT, GMX, or GRP format. 

If you want to use files from an earlier version of MSigDB you will need to download them from the archived releases on the website.

number of permutations * Specifies the number of permutations to perform in assessing the statistical significance of the enrichment score. It is best to start with a small number, such as 10, in order to check that your analysis will complete successfully (e.g., ensuring you have gene sets that satisfy the minimum and maximum size requirements). After the analysis completes successfully, run it again with a full set of permutations. The recommended number of permutations is 1000. Default: 1000
scoring scheme *

The enrichment statistic.  This parameter affects the running-sum statistic used for the enrichment analysis, controlling the value of p used in the enrichment score calculation.  Options are:

  • classic Kolmorogorov-Smirnov: p=0
  • weighted (default): p=1; a running sum statistic that is incremented by the absolute value of the ranking metric when a gene belongs to the set (see the 2005 PNAS paper for details)
  • weighted_p2: p=2
  • weighted_p1.5: p=1.5
max gene set size * After filtering from the gene sets any gene not in the expression dataset, gene sets larger than this are excluded from the analysis. Default: 500
min gene set size * After filtering from the gene sets any gene not in the expression dataset, gene sets smaller than this are excluded from the analysis. Default: 15
normalization mode *

Method used to normalize the enrichment scores across analyzed gene sets. Options are:

  • meandiv (default): GSEA normalizes the enrichment scores as described in Normalized Enrichment Score (NES) in the GSEA User Guide.
  • None: GSEA does not normalize the enrichment scores.
make detailed gene set report * Create detailed gene set report (heat map, mountain plot, etc.) for each enriched gene set. Default: true
num top sets * GSEAPreranked generates summary plots and detailed analysis results for the top x genes in each phenotype, where x is 20 by default. The top genes are those with the largest normalized enrichment scores. Default: 20
random seed * Seed used to generate a random number for phenotype and gene_set permutations. Timestamp is the default. Using a specific integer-valued seed generates consistent results, which is useful when testing software.
output file name * Name of the output file. The name cannot include spaces. Default: <expression.dataset_basename>.zip
create svgs * Whether to create SVG images (compressed) along with PNGs. Saving PNGs requires a lot of storage; therefore, this parameter is set to false by default. 
selected gene sets Semicolon-separated list of gene sets from the provided gene sets database files (GMT/GMX/GRP). If you are using multiple files then you *must* prefix each selected gene set with its file name followed by '#' (like "my_file1.gmt#selected_gene_set1,my_file2.gmt#selected_gene_set2"). With a single file only the names are necessary. Leave this blank to select all gene sets. 
alt delim Optional alternate delimiter character for gene set names instead of comma for use with selected.gene.sets. If used, a semicolon is recommended. 
create zip * Create a ZIP bundle of the output files. This is true by default, matching the former behavior where a ZIP bundle was always created.

* - required

Input Files

1. ranked list:  RNK file

This file contains the rank ordered gene (or feature) list.

2. gene sets database file: GMTGMX, or GRP file

Gene set files, either your own or from the listed MSigDB files.

Output Files

1. Optional Enrichment Report archive: ZIP

ZIP file containing the result files.  For more information on interpreting these results, see Interpreting GSEA Results in the GSEA User Guide. Note that in prior versions the ZIP bundle was created as the only output file. This behavior has been changed to give direct access to the results without the need for a download. The default is to create the ZIP bundle, matching the former behavior, but the report files will always be created directly.

2. Enrichment Report: HTML and PNG images

The GSEA Enrichment Report.  As above, see the GSEA User Guide for more info.

3. Optional SVG images (compressed)

Identical to the PNGs in the Enrichment Report, but in SVG format for higher resolution. These are GZ compressed to reduce space usage; they can be decompressed using 'gunzip' on Mac or Linux and 7-Zip on Windows

Platform Dependencies

Task Type:
Gene List Selection

CPU Type:
any

Operating System:
any

Language:
Java

Version Comments

Version Release Date Description
6.0.12 2019-10-10 Updated to use the GSEA v3.0 open-source code base. Updated to give access to MSigDB v6.2. Unified the Gene Set DB selector parameters and better downloading of MSigDB files. Added selected.gene.sets, alt.delim and create.svgs parameters. Better temp file clean-up and other internal code improvements.
5 2017-05-18 Updated to give access to MSigDB v6.0
4 2016-02-04 Updated to give access to MSigDB v5.1
3 2015-12-04 Updating the GSEA jar to deal with an issue with FTP access. Fixes an issue for GP@IU.
2 2015-06-16 Updated for MSigDB v5.0 and hallmark gene sets support.
1 2013-06-17 Initial Release