Javascript must be enabled to continue!

Feature selection and nearest centroid classification for protein mass spectrometry

Abstract Background The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification. Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved. Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied. Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms. This paper examines feature selection techniques for proteomic mass spectrometry. Results This study examines the performance of the nearest centroid classifier coupled with the following feature selection algorithms. Student-t test, Kolmogorov-Smirnov test, and the P-test are univariate statistics used for filter-based feature ranking. From the wrapper approaches we tested sequential forward selection and a modified version of sequential backward selection. Embedded approaches included shrunken nearest centroid and a novel version of boosting based feature selection we developed. In addition, we tested several dimensionality reduction approaches, namely principal component analysis and principal component analysis coupled with linear discriminant analysis. To fairly assess each algorithm, evaluation was done using stratified cross validation with an internal leave-one-out cross-validation loop for automated feature selection. Comprehensive experiments, conducted on five popular cancer data sets, revealed that the less advocated sequential forward selection and boosted feature selection algorithms produce the most consistent results across all data sets. In contrast, the state-of-the-art performance reported on isolated data sets for several of the studied algorithms, does not hold across all data sets. Conclusion This study tested a number of popular feature selection methods using the nearest centroid classifier and found that several reportedly state-of-the-art algorithms in fact perform rather poorly when tested via stratified cross-validation. The revealed inconsistencies provide clear evidence that algorithm evaluation should be performed on several data sets using a consistent (i.e., non-randomized, stratified) cross-validation procedure in order for the conclusions to be statistically sound.

Springer Science and Business Media LLC

Ilya Levner

BMC Bioinformatics

2005

Title: Feature selection and nearest centroid classification for protein mass spectrometry

Description:

Abstract Background The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification.

Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved.

Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied.

Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms.

This paper examines feature selection techniques for proteomic mass spectrometry.

Results This study examines the performance of the nearest centroid classifier coupled with the following feature selection algorithms.

Student-t test, Kolmogorov-Smirnov test, and the P-test are univariate statistics used for filter-based feature ranking.

From the wrapper approaches we tested sequential forward selection and a modified version of sequential backward selection.

Embedded approaches included shrunken nearest centroid and a novel version of boosting based feature selection we developed.

In addition, we tested several dimensionality reduction approaches, namely principal component analysis and principal component analysis coupled with linear discriminant analysis.

To fairly assess each algorithm, evaluation was done using stratified cross validation with an internal leave-one-out cross-validation loop for automated feature selection.

Comprehensive experiments, conducted on five popular cancer data sets, revealed that the less advocated sequential forward selection and boosted feature selection algorithms produce the most consistent results across all data sets.

In contrast, the state-of-the-art performance reported on isolated data sets for several of the studied algorithms, does not hold across all data sets.

Conclusion This study tested a number of popular feature selection methods using the nearest centroid classifier and found that several reportedly state-of-the-art algorithms in fact perform rather poorly when tested via stratified cross-validation.

The revealed inconsistencies provide clear evidence that algorithm evaluation should be performed on several data sets using a consistent (i.

, non-randomized, stratified) cross-validation procedure in order for the conclusions to be statistically sound.

Back

Smart manufacturing has been developed since the introduction of Industry 4.0. It consists of resource sharing and networking, predictive engineering, and material and data analyti...

Roles for Spectral Centroid and Other Factors in Determining "Blended" Instrument Pairings in Orchestration

Three perceptual experiments using natural-sounding instrument tones arranged in concurrently sounding pairs investigate a problem of orchestration: what factors determine selectio...

Poems

poems selection poems selection poems selection poems selection poems selection poems selection poems selection poems selection poems selection poems selection poems selection poem...

MSFC: A New Feature Construction Method for Accurate Diagnosis of Mass Spectrometry Data

Abstract Background Mass spectrometry technology can realize dynamic detection of many complex matrix samples in a simple, rapid, compassionate, precise, and high-throughp...

Endothelial Protein C Receptor

IntroductionThe protein C anticoagulant pathway plays a critical role in the negative regulation of the blood clotting response. The pathway is triggered by thrombin, which allows ...

Breast Carcinoma within Fibroadenoma: A Systematic Review

Abstract Introduction Fibroadenoma is the most common benign breast lesion; however, it carries a potential risk of malignant transformation. This systematic review provides an ove...

Self-Adaptive particle swarm optimization for large-scale feature selection in classification

Email:
Password:

Email:

Feature selection and nearest centroid classification for protein mass spectrometry

Related Results