Lack of sufficiently strong informative features limits the potential of gene expression analysis as predictive tool for many clinical classification problems

Kenneth R. Hess, Caimiao Wei, Yuan Qi, Takayuki Iwamoto, W. Fraser Symmans, Lajos Pusztai

Research output: Contribution to journalArticle

9 Citations (Scopus)

Abstract

Background: Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. We inserted gene signatures into three real data sets by altering the expression level of existing probe sets. We varied the number of probe sets perturbed (signature size), the fold increase of mean probe set expression in perturbed compared to unperturbed data (signature strength) and the number of samples perturbed. Prediction models were trained to identify which cases had been perturbed. Performance was estimated using Monte-Carlo cross validation.Results: Signature strength had the greatest influence on predictor performance. It was possible to develop almost perfect predictors with as few as 10 features if the fold difference in mean expression values were > 2 even when the spiked samples represented 10% of all samples. We also assessed the gene signature set size and strength for 9 real clinical prediction problems in six different breast cancer data sets.Conclusions: We found sufficiently large and strong predictive signatures only for distinguishing ER-positive from ER-negative cancers, there were no strong signatures for more subtle prediction problems. Current statistical methods efficiently identify highly informative features in gene expression data if such features exist and accurate models can be built with as few as 10 highly informative features. Features can be considered highly informative if at least 2-fold expression difference exists between comparison groups but such features do not appear to be common for many clinically relevant prediction problems in human data sets.

Original languageEnglish
Article number463
JournalBMC Bioinformatics
Volume12
DOIs
Publication statusPublished - Dec 1 2011
Externally publishedYes

Fingerprint

Gene Expression Analysis
Gene expression
Classification Problems
Signature
Gene Expression
Genes
Gene
Fold
Probe
Prediction Model
Prediction
Predictors
Breast Neoplasms
Statistical methods
Gene Expression Data
Breast Cancer
Cross-validation
Statistical method
Datasets
Neoplasms

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics
  • Structural Biology

Cite this

Lack of sufficiently strong informative features limits the potential of gene expression analysis as predictive tool for many clinical classification problems. / Hess, Kenneth R.; Wei, Caimiao; Qi, Yuan; Iwamoto, Takayuki; Symmans, W. Fraser; Pusztai, Lajos.

In: BMC Bioinformatics, Vol. 12, 463, 01.12.2011.

Research output: Contribution to journalArticle

@article{5bbeb6c43ad04149bb59f0854a853c23,
title = "Lack of sufficiently strong informative features limits the potential of gene expression analysis as predictive tool for many clinical classification problems",
abstract = "Background: Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. We inserted gene signatures into three real data sets by altering the expression level of existing probe sets. We varied the number of probe sets perturbed (signature size), the fold increase of mean probe set expression in perturbed compared to unperturbed data (signature strength) and the number of samples perturbed. Prediction models were trained to identify which cases had been perturbed. Performance was estimated using Monte-Carlo cross validation.Results: Signature strength had the greatest influence on predictor performance. It was possible to develop almost perfect predictors with as few as 10 features if the fold difference in mean expression values were > 2 even when the spiked samples represented 10{\%} of all samples. We also assessed the gene signature set size and strength for 9 real clinical prediction problems in six different breast cancer data sets.Conclusions: We found sufficiently large and strong predictive signatures only for distinguishing ER-positive from ER-negative cancers, there were no strong signatures for more subtle prediction problems. Current statistical methods efficiently identify highly informative features in gene expression data if such features exist and accurate models can be built with as few as 10 highly informative features. Features can be considered highly informative if at least 2-fold expression difference exists between comparison groups but such features do not appear to be common for many clinically relevant prediction problems in human data sets.",
author = "Hess, {Kenneth R.} and Caimiao Wei and Yuan Qi and Takayuki Iwamoto and Symmans, {W. Fraser} and Lajos Pusztai",
year = "2011",
month = "12",
day = "1",
doi = "10.1186/1471-2105-12-463",
language = "English",
volume = "12",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Lack of sufficiently strong informative features limits the potential of gene expression analysis as predictive tool for many clinical classification problems

AU - Hess, Kenneth R.

AU - Wei, Caimiao

AU - Qi, Yuan

AU - Iwamoto, Takayuki

AU - Symmans, W. Fraser

AU - Pusztai, Lajos

PY - 2011/12/1

Y1 - 2011/12/1

N2 - Background: Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. We inserted gene signatures into three real data sets by altering the expression level of existing probe sets. We varied the number of probe sets perturbed (signature size), the fold increase of mean probe set expression in perturbed compared to unperturbed data (signature strength) and the number of samples perturbed. Prediction models were trained to identify which cases had been perturbed. Performance was estimated using Monte-Carlo cross validation.Results: Signature strength had the greatest influence on predictor performance. It was possible to develop almost perfect predictors with as few as 10 features if the fold difference in mean expression values were > 2 even when the spiked samples represented 10% of all samples. We also assessed the gene signature set size and strength for 9 real clinical prediction problems in six different breast cancer data sets.Conclusions: We found sufficiently large and strong predictive signatures only for distinguishing ER-positive from ER-negative cancers, there were no strong signatures for more subtle prediction problems. Current statistical methods efficiently identify highly informative features in gene expression data if such features exist and accurate models can be built with as few as 10 highly informative features. Features can be considered highly informative if at least 2-fold expression difference exists between comparison groups but such features do not appear to be common for many clinically relevant prediction problems in human data sets.

AB - Background: Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. We inserted gene signatures into three real data sets by altering the expression level of existing probe sets. We varied the number of probe sets perturbed (signature size), the fold increase of mean probe set expression in perturbed compared to unperturbed data (signature strength) and the number of samples perturbed. Prediction models were trained to identify which cases had been perturbed. Performance was estimated using Monte-Carlo cross validation.Results: Signature strength had the greatest influence on predictor performance. It was possible to develop almost perfect predictors with as few as 10 features if the fold difference in mean expression values were > 2 even when the spiked samples represented 10% of all samples. We also assessed the gene signature set size and strength for 9 real clinical prediction problems in six different breast cancer data sets.Conclusions: We found sufficiently large and strong predictive signatures only for distinguishing ER-positive from ER-negative cancers, there were no strong signatures for more subtle prediction problems. Current statistical methods efficiently identify highly informative features in gene expression data if such features exist and accurate models can be built with as few as 10 highly informative features. Features can be considered highly informative if at least 2-fold expression difference exists between comparison groups but such features do not appear to be common for many clinically relevant prediction problems in human data sets.

UR - http://www.scopus.com/inward/record.url?scp=82355160911&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=82355160911&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-12-463

DO - 10.1186/1471-2105-12-463

M3 - Article

VL - 12

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 463

ER -