CollapseDataset (v2.2.x)
Author: Aravind Subramanian, Pablo Tamayo, David Eby; Broad
Institute
Contact:
Description
Collapses expression values from multiple input ids that map to a single target gene to a single value on a per-sample basis.
Summary
CollapseDataset utilizes Probe Set ID to Gene Symbol mappings from a CHIP file to create
a new data set where all probe set expression values for a given gene have been collapsed
into a single expression value for each sample. It does this for all probe sets in the
original data set. You can choose the method of determining which value will represent all
probes in a probe set for a given gene: the maximum expression value for the probe set, the
median value, or the sum of all values. The new data set uses gene symbols as the gene
identifier format. Collapsing the probe sets eliminates multiple probes, which can inflate
gene set enrichment scores, and facilitates the biological interpretation of the gene set
enrichment analysis results.
References
This version of the module is based on the GSEA v4.3.x code base. See the
GSEA Website for more details.
Parameters
- dataset file*: This is a file in either
GCT
or RES
format that contains the expression dataset.
GSEA-specific TXT format files are also accepted.
- chip platform *: This drop-down allows you to specify
the chip annotation file, which lists each probe on a chip and its matching HUGO gene symbol,
used for the expression array. The chip files listed here are from the
GSEA website. If you used a file not listed
here, you will need to provide it
(in CHIP format)
using ‘Upload your own file’. Please see the MSigDB 7.0 Release Notes
for information about symbol remapping.
- collapse mode *: Collapsing mode for sets of multiple probes for a single gene. Selects the expression values to use for the single probe that will represent all probe sets for the gene. Options are:
- Max_probe (default): For each sample, use the maximum expression value for the probe set. That is, if there are three probes that map to a single gene, the expression value that will represent the collapsed probe set will be the maximum expression value from those three probes.
- Median_of_probes: For each sample, use the median expression value for the probe set.
- Mean_of_probes: For each sample, use the mean expression value for the probe set.
- Sum_of_probes: For each sample, sum all the expression values of the probe set.
- Abs_max_of_probes: For each sample, use the expression value for the probe set with the maximum absolute value. Note that each value retains its original sign but is chosen based on absolute value.
In other words, the largest magnitude value is used. While this method is useful with computational-based input datasets it is generally not recommended for use with quantification-based expression
measures such as counts or microarray fluorescence.
- Remap_only: Remap symbols from one namespace to another without collapsing (an error will occur if multiple source genes map to a single destination gene).
* = required
Advanced Parameters
-
output.file.name:
Optionally, rename the result file to a user-supplied name. By default, this will be <dataset.file_basename>_collapsed (for any of the collapsing modes) or
<dataset.file_basename>_remapped (for Remap_only).
-
omit features with no symbol match:
By default (true), the new dataset excludes probes/genes that have no gene symbols. Set to false
to have the new dataset contain all probes/genes that were in the original dataset.
- dataset file: This file contains the expression dataset in GCT
or RES format that contains the
expression dataset.
GSEA-specific TXT format files are also accepted.
- chip platform: This file defines symbol-to-gene mappings for a platform, possibly along with annotations, in
CHIP format.
The drop-down provides files from the MSigDB project for common platforms, but custom files may also be provided.
Output Files
- The collapsed data set (GCT): After collapsing, the resulting files are always produced in GCT format even if they had a different input format.
Known Issues
Input files with spaces or special characters in their file names may cause errors.
Task Type: Gene List Selection
CPU Type: any
Operating System: any
Language: Java
- 2.2.0 (2022-10-2): Updated to Human MSigDB v2022.1.Hs and Mouse MSigDB 2022.1.Mm.
- 2.1.5 (2022-9-15): Updated to Human MSigDB v2022.1.Hs. Direct support for Mouse MSigDB 2022.1.Mm is not yet available.
- 2.1.4 (2022-3-22): Removed Log4J entirely from the code base. Fixed weighted_p1.5 computation. Added min dataset size warnings.
- 2.1.3 (2022-1-20): Updated to Log4J 2.17.1.
- 2.1.2 (2022-1-12): Fixed a typo in the command line.
- 2.1.1 (2021-12-23): Updated with the GSEA Desktop 4.2.1 code base. Updated to Log4J 2.17.0. TXT file parser bug fix.
- 2.1.0 (2021-12-17): Updated with the GSEA Desktop 4.2.0 code base with numerous bug fixes. Adds the Abs_max_of_probes collapse mode. Fixes some issues handling datasets with missing values. Improved warnings and logging. Adds an output file name parameter. Fixed bugs in weighted_p1.5 scoring.
- 2.0.2 (2021-03-22): Fixed minor typo.
- 2.0.1 (2021-03-22): Minor doc updates
- 2.0.0 (2021-01-14): Switched to the GSEA code base. Added new collapse.mode options and omit.features.with.no.symbol.match parameter.
Copyright © 2003-2022 Broad Institute, Inc., Massachusetts Institute of
Technology, and Regents of the University of California. All rights
reserved.