CollapseDataset

A GenePattern module for running GSEA's CollapseDataset utility

CollapseDataset (v2.2.x)

Author: Aravind Subramanian, Pablo Tamayo, David Eby; Broad Institute

Contact:

Description

Collapses expression values from multiple input ids that map to a single target gene to a single value on a per-sample basis.

Summary

CollapseDataset utilizes Probe Set ID to Gene Symbol mappings from a CHIP file to create a new data set where all probe set expression values for a given gene have been collapsed into a single expression value for each sample. It does this for all probe sets in the original data set. You can choose the method of determining which value will represent all probes in a probe set for a given gene: the maximum expression value for the probe set, the median value, or the sum of all values. The new data set uses gene symbols as the gene identifier format. Collapsing the probe sets eliminates multiple probes, which can inflate gene set enrichment scores, and facilitates the biological interpretation of the gene set enrichment analysis results.

References

This version of the module is based on the GSEA v4.3.x code base. See the GSEA Website for more details.

Parameters

  1. dataset file*: This is a file in either GCT or RES format that contains the expression dataset. GSEA-specific TXT format files are also accepted.
  2. chip platform *: This drop-down allows you to specify the chip annotation file, which lists each probe on a chip and its matching HUGO gene symbol, used for the expression array. The chip files listed here are from the GSEA website. If you used a file not listed here, you will need to provide it (in CHIP format) using ‘Upload your own file’. Please see the MSigDB 7.0 Release Notes for information about symbol remapping.
  3. collapse mode *: Collapsing mode for sets of multiple probes for a single gene. Selects the expression values to use for the single probe that will represent all probe sets for the gene. Options are:
    • Max_probe (default): For each sample, use the maximum expression value for the probe set. That is, if there are three probes that map to a single gene, the expression value that will represent the collapsed probe set will be the maximum expression value from those three probes.
    • Median_of_probes: For each sample, use the median expression value for the probe set.
    • Mean_of_probes: For each sample, use the mean expression value for the probe set.
    • Sum_of_probes: For each sample, sum all the expression values of the probe set.
    • Abs_max_of_probes: For each sample, use the expression value for the probe set with the maximum absolute value. Note that each value retains its original sign but is chosen based on absolute value. In other words, the largest magnitude value is used. While this method is useful with computational-based input datasets it is generally not recommended for use with quantification-based expression measures such as counts or microarray fluorescence.
    • Remap_only: Remap symbols from one namespace to another without collapsing (an error will occur if multiple source genes map to a single destination gene).

* = required

Advanced Parameters

  1. output.file.name: Optionally, rename the result file to a user-supplied name. By default, this will be <dataset.file_basename>_collapsed (for any of the collapsing modes) or <dataset.file_basename>_remapped (for Remap_only).

  2. omit features with no symbol match: By default (true), the new dataset excludes probes/genes that have no gene symbols. Set to false to have the new dataset contain all probes/genes that were in the original dataset.

Input Files

  1. dataset file: This file contains the expression dataset in GCT or RES format that contains the expression dataset. GSEA-specific TXT format files are also accepted.
  2. chip platform: This file defines symbol-to-gene mappings for a platform, possibly along with annotations, in CHIP format. The drop-down provides files from the MSigDB project for common platforms, but custom files may also be provided.

Output Files

  1. The collapsed data set (GCT): After collapsing, the resulting files are always produced in GCT format even if they had a different input format.

Known Issues

Input files with spaces or special characters in their file names may cause errors.

Platform Dependencies

Task Type: Gene List Selection

CPU Type: any

Operating System: any

Language: Java

Version Comments

Copyright © 2003-2022 Broad Institute, Inc., Massachusetts Institute of Technology, and Regents of the University of California. All rights reserved.