ssGSEA GenePattern Module Documentation

A GenePattern module for running the ssGSEA method

ssGSEA (v10.1.x)

Performs single sample GSEA

Author: GenePattern

Contact:

Contact the GenePattern team for GenePattern issues.

See the GSEA forum for GSEA questions.

Algorithm Version:

Description

Project each sample within a data set onto a space of gene set enrichment scores using the ssGSEA projection methodology described in Barbie et al., 2009.

Summary

Single-sample GSEA (ssGSEA), an extension of Gene Set Enrichment Analysis (GSEA), calculates separate enrichment scores for each pairing of a sample and gene set. Each ssGSEA enrichment score represents the degree to which the genes in a particular gene set are coordinately up- or down-regulated within a sample.

NOTE: with the release of v10.0.1, this module was renamed from “ssGSEAProjection” to just “ssGSEA”.

Discussion

When analyzing genome-wide transcription profiles from microarray data, a typical goal is to find genes significantly differentially correlated with distinct sample classes defined by a particular phenotype (e.g., tumor vs. normal). These findings can be used to provide insights into the underlying biological mechanisms or to classify (predict the phenotype of) a new sample. Gene Set Enrichment Analysis (GSEA) addressed this problem by evaluating whether a priori defined sets of genes, associated with particular biological processes (such as pathways), chromosomal locations, or experimental results are enriched at either the top or bottom of a list of differentially expressed genes ranked by some measure of differences in a gene’s expression across sample classes. Examples of ranking metrics are fold change for categorical phenotypes (e.g., tumor vs. normal) and Pearson correlation for continuous phenotypes (e.g., age). Enrichment provides evidence for the coordinate up- or down-regulation of a gene set’s members and the activation or repression of some corresponding biological process.

Where GSEA generates a gene set’s enrichment score with respect to phenotypic differences across a collection of samples within a dataset, ssGSEA calculates a separate enrichment score for each pairing of sample and gene set, independent of phenotype labeling. In this manner, ssGSEA transforms a single sample’s gene expression profile to a gene set enrichment profile. A gene set’s enrichment score represents the activity level of the biological process in which the gene set’s members are coordinately up- or down-regulated. This transformation allows researchers to characterize cell state in terms of the activity levels of biological processes and pathways rather than through the expression levels of individual genes.

In working with the transformed data, the goal is to find biological processes that are differentially active across the phenotype of interest and to use these measures of process activity to characterize the phenotype. Thus, the benefit here is that the ssGSEA projection transforms the data to a higher-level (pathways instead of genes) space representing a more biologically interpretable set of features on which analytic methods can be applied.

As a practical matter, ssGSEA essentially reduces the dimensionality of the set. You can look for correlations between the gene set enrichment scores and the phenotype of interest (e.g., tumor vs. normal) using the GCT output with a module like ComparativeMarkerSelection. You could also try clustering the data set; whichever gene sets stand out as strong predictors of the phenotype of interest, specific clusters can then be mapped to biochemical pathways, giving you insight into what is driving the phenotype of interest.

While the GCT can be passed along to any module accepting that format, it does not make sense to run it through GSEA.

This module implements the single-sample GSEA projection methodology described in Barbie et al, 2009.

NOTE: with the release of v10.0.1, this module was renamed from “ssGSEAProjection” to just “ssGSEA” for clarity and brevity, as it is commonly referred to by this name. The underlying algorithm and code remain the same.

References

  1. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS. 2005;102(43):15545-15550. http://www.pnas.org/content/102/43/15545.abstract
  2. Barbie DA, Tamayo P, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009;462:108-112.

Parameters

* = required

Input Files

  1. Input expression dataset

The GCT file containing the input dataset’s gene expression data (see the GCT file format information). Gene symbols are typically listed in the column with header Name; however, GCT files containing RNAi data may list the gene symbol name in alternative columns. The “gene symbol column name” parameter specifies which of the input GCT file’s columns contains the gene symbols.

The input GCT file’s row identifiers must draw from the same family of gene identifiers (the same ontology or name space, such as HUGO Gene Nomenclature) as those used to identify genes in the gene sets database file (see next item below). Typically these are human gene symbols.

If a GCT file’s row identifiers are probe IDs, and gene sets are defined through a list of human gene symbols, it will be necessary to collapse all probe set expression values for a given gene into a single expression value and use a human gene symbol to represent that gene. The CollapseDataset GenePattern module can make this transformation.

  1. Gene sets database files

One or more optional GMT or GMX file containing a collection of gene set definitions (see the GMT file format and the GMX file format in the GenePattern file formats documentation).

Output Files

  1. Output enrichment score dataset

A GCT file containing the input dataset’s projection onto a space of specified gene sets. This GCT file may serve as input into GenePattern’s many clustering and classification algorithms.

In the case of experimentally derived gene sets with _UP and _DN suffixes appended to otherwise identical gene set names, combine modes of combine.add and combine.replace will either add to the set or replace the original gene set pair with a combined gene set with the suffix removed from the name thereby creating new gene set names that may impact downstream applications using these files in combination with the original gene set collection file. Check that downstream applications utilize subsets of gene sets within a collection for compatibility with the combine.add mode output.

Platform Dependencies

Task Type:
Projection

CPU Type:
any

Operating System:
any

Language:
R-3.2

Version Comments

Copyright © 2003-2022 Broad Institute, Inc., Massachusetts Institute of Technology, and Regents of the University of California. All rights reserved.