Javascript must be enabled to continue!
Feature selection and nearest centroid classification for protein mass spectrometry
View through CrossRef
Abstract
Background
The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification. Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved. Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied. Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms. This paper examines feature selection techniques for proteomic mass spectrometry.
Results
This study examines the performance of the nearest centroid classifier coupled with the following feature selection algorithms. Student-t test, Kolmogorov-Smirnov test, and the P-test are univariate statistics used for filter-based feature ranking. From the wrapper approaches we tested sequential forward selection and a modified version of sequential backward selection. Embedded approaches included shrunken nearest centroid and a novel version of boosting based feature selection we developed. In addition, we tested several dimensionality reduction approaches, namely principal component analysis and principal component analysis coupled with linear discriminant analysis. To fairly assess each algorithm, evaluation was done using stratified cross validation with an internal leave-one-out cross-validation loop for automated feature selection. Comprehensive experiments, conducted on five popular cancer data sets, revealed that the less advocated sequential forward selection and boosted feature selection algorithms produce the most consistent results across all data sets. In contrast, the state-of-the-art performance reported on isolated data sets for several of the studied algorithms, does not hold across all data sets.
Conclusion
This study tested a number of popular feature selection methods using the nearest centroid classifier and found that several reportedly state-of-the-art algorithms in fact perform rather poorly when tested via stratified cross-validation. The revealed inconsistencies provide clear evidence that algorithm evaluation should be performed on several data sets using a consistent (i.e., non-randomized, stratified) cross-validation procedure in order for the conclusions to be statistically sound.
Title: Feature selection and nearest centroid classification for protein mass spectrometry
Description:
Abstract
Background
The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification.
Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved.
Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied.
Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms.
This paper examines feature selection techniques for proteomic mass spectrometry.
Results
This study examines the performance of the nearest centroid classifier coupled with the following feature selection algorithms.
Student-t test, Kolmogorov-Smirnov test, and the P-test are univariate statistics used for filter-based feature ranking.
From the wrapper approaches we tested sequential forward selection and a modified version of sequential backward selection.
Embedded approaches included shrunken nearest centroid and a novel version of boosting based feature selection we developed.
In addition, we tested several dimensionality reduction approaches, namely principal component analysis and principal component analysis coupled with linear discriminant analysis.
To fairly assess each algorithm, evaluation was done using stratified cross validation with an internal leave-one-out cross-validation loop for automated feature selection.
Comprehensive experiments, conducted on five popular cancer data sets, revealed that the less advocated sequential forward selection and boosted feature selection algorithms produce the most consistent results across all data sets.
In contrast, the state-of-the-art performance reported on isolated data sets for several of the studied algorithms, does not hold across all data sets.
Conclusion
This study tested a number of popular feature selection methods using the nearest centroid classifier and found that several reportedly state-of-the-art algorithms in fact perform rather poorly when tested via stratified cross-validation.
The revealed inconsistencies provide clear evidence that algorithm evaluation should be performed on several data sets using a consistent (i.
e.
, non-randomized, stratified) cross-validation procedure in order for the conclusions to be statistically sound.
Related Results
Optimising tool wear and workpiece condition monitoring via cyber-physical systems for smart manufacturing
Optimising tool wear and workpiece condition monitoring via cyber-physical systems for smart manufacturing
Smart manufacturing has been developed since the introduction of Industry 4.0. It consists of resource sharing and networking, predictive engineering, and material and data analyti...
Roles for Spectral Centroid and Other Factors in Determining "Blended" Instrument Pairings in Orchestration
Roles for Spectral Centroid and Other Factors in Determining "Blended" Instrument Pairings in Orchestration
Three perceptual experiments using natural-sounding instrument tones arranged in concurrently sounding pairs investigate a problem of orchestration: what factors determine selectio...
MSFC: A New Feature Construction Method for Accurate Diagnosis of Mass Spectrometry Data
MSFC: A New Feature Construction Method for Accurate Diagnosis of Mass Spectrometry Data
Abstract
Background
Mass spectrometry technology can realize dynamic detection of many complex matrix samples in a simple, rapid, compassionate, precise, and high-throughp...
Endothelial Protein C Receptor
Endothelial Protein C Receptor
IntroductionThe protein C anticoagulant pathway plays a critical role in the negative regulation of the blood clotting response. The pathway is triggered by thrombin, which allows ...
Breast Carcinoma within Fibroadenoma: A Systematic Review
Breast Carcinoma within Fibroadenoma: A Systematic Review
Abstract
Introduction
Fibroadenoma is the most common benign breast lesion; however, it carries a potential risk of malignant transformation. This systematic review provides an ove...
Self-Adaptive particle swarm optimization for large-scale feature selection in classification
Self-Adaptive particle swarm optimization for large-scale feature selection in classification
© 2019 Association for Computing Machinery. Many evolutionary computation (EC) methods have been used to solve feature selection problems and they perform well on most small-scale ...
Self-Adaptive particle swarm optimization for large-scale feature selection in classification
Self-Adaptive particle swarm optimization for large-scale feature selection in classification
© 2019 Association for Computing Machinery. Many evolutionary computation (EC) methods have been used to solve feature selection problems and they perform well on most small-scale ...


