CollapseDataset (v2.2.x)

Author: Aravind Subramanian, Pablo Tamayo, David Eby; Broad Institute

Contact:

gp-help@broadinstitute.org
http://software.broadinstitute.org/cancer/software/genepattern/contact

Description

Collapses expression values from multiple input ids that map to a single target gene to a single value on a per-sample basis.

Summary

CollapseDataset utilizes Probe Set ID to Gene Symbol mappings from a CHIP file to create a new data set where all probe set expression values for a given gene have been collapsed into a single expression value for each sample. It does this for all probe sets in the original data set. You can choose the method of determining which value will represent all probes in a probe set for a given gene: the maximum expression value for the probe set, the median value, or the sum of all values. The new data set uses gene symbols as the gene identifier format. Collapsing the probe sets eliminates multiple probes, which can inflate gene set enrichment scores, and facilitates the biological interpretation of the gene set enrichment analysis results.

References

This version of the module is based on the GSEA v4.3.x code base. See the GSEA Website for more details.

Parameters

dataset file*: This is a file in either GCT or RES format that contains the expression dataset. GSEA-specific TXT format files are also accepted.
chip platform *: This drop-down allows you to specify the chip annotation file, which lists each probe on a chip and its matching HUGO gene symbol, used for the expression array. The chip files listed here are from the GSEA website. If you used a file not listed here, you will need to provide it (in CHIP format) using ‘Upload your own file’. Please see the MSigDB 7.0 Release Notes for information about symbol remapping.
collapse mode *: Collapsing mode for sets of multiple probes for a single gene. Selects the expression values to use for the single probe that will represent all probe sets for the gene. Options are:
- Max_probe (default): For each sample, use the maximum expression value for the probe set. That is, if there are three probes that map to a single gene, the expression value that will represent the collapsed probe set will be the maximum expression value from those three probes.
- Median_of_probes: For each sample, use the median expression value for the probe set.
- Mean_of_probes: For each sample, use the mean expression value for the probe set.
- Sum_of_probes: For each sample, sum all the expression values of the probe set.
- Abs_max_of_probes: For each sample, use the expression value for the probe set with the maximum absolute value. Note that each value retains its original sign but is chosen based on absolute value. In other words, the largest magnitude value is used. While this method is useful with computational-based input datasets it is generally not recommended for use with quantification-based expression measures such as counts or microarray fluorescence.
- Remap_only: Remap symbols from one namespace to another without collapsing (an error will occur if multiple source genes map to a single destination gene).

* = required

Advanced Parameters

output.file.name: Optionally, rename the result file to a user-supplied name. By default, this will be <dataset.file_basename>_collapsed (for any of the collapsing modes) or <dataset.file_basename>_remapped (for Remap_only).
omit features with no symbol match: By default (true), the new dataset excludes probes/genes that have no gene symbols. Set to false to have the new dataset contain all probes/genes that were in the original dataset.

Input Files

dataset file: This file contains the expression dataset in GCT or RES format that contains the expression dataset. GSEA-specific TXT format files are also accepted.
chip platform: This file defines symbol-to-gene mappings for a platform, possibly along with annotations, in CHIP format. The drop-down provides files from the MSigDB project for common platforms, but custom files may also be provided.

Output Files

The collapsed data set (GCT): After collapsing, the resulting files are always produced in GCT format even if they had a different input format.

Known Issues

Input files with spaces or special characters in their file names may cause errors.

Platform Dependencies

Task Type: Gene List Selection

CPU Type: any

Operating System: any

Language: Java

Version Comments

2.2.0 (2022-10-2): Updated to Human MSigDB v2022.1.Hs and Mouse MSigDB 2022.1.Mm.
2.1.5 (2022-9-15): Updated to Human MSigDB v2022.1.Hs. Direct support for Mouse MSigDB 2022.1.Mm is not yet available.
2.1.4 (2022-3-22): Removed Log4J entirely from the code base. Fixed weighted_p1.5 computation. Added min dataset size warnings.
2.1.3 (2022-1-20): Updated to Log4J 2.17.1.
2.1.2 (2022-1-12): Fixed a typo in the command line.
2.1.1 (2021-12-23): Updated with the GSEA Desktop 4.2.1 code base. Updated to Log4J 2.17.0. TXT file parser bug fix.
2.1.0 (2021-12-17): Updated with the GSEA Desktop 4.2.0 code base with numerous bug fixes. Adds the Abs_max_of_probes collapse mode. Fixes some issues handling datasets with missing values. Improved warnings and logging. Adds an output file name parameter. Fixed bugs in weighted_p1.5 scoring.
2.0.2 (2021-03-22): Fixed minor typo.
2.0.1 (2021-03-22): Minor doc updates
2.0.0 (2021-01-14): Switched to the GSEA code base. Added new collapse.mode options and omit.features.with.no.symbol.match parameter.