Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Confound-leakage: confound removal in machine learning leads to leakage

View through CrossRef
Abstract Background Machine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine. Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target. Problematically, ML models and their predictions can be biased by confounding information present in the features. To remove this spurious signal, researchers often employ featurewise linear confound regression (CR). While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood. Results We provide new evidence that, contrary to general expectations, linear confound regression can increase the risk of confounding when combined with nonlinear ML approaches. Using a simple framework that uses the target as a confound, we show that information leaked via CR can increase null or moderate effects to near-perfect prediction. By shuffling the features, we provide evidence that this increase is indeed due to confound-leakage and not due to revealing of information. We then demonstrate the danger of confound-leakage in a real-world clinical application where the accuracy of predicting attention-deficit/hyperactivity disorder is overestimated using speech-derived features when using depression as a confound. Conclusions Mishandling or even amplifying confounding effects when building ML models due to confound-leakage, as shown, can lead to untrustworthy, biased, and unfair predictions. Our expose of the confound-leakage pitfall and provided guidelines for dealing with it can help create more robust and trustworthy ML models.
Title: Confound-leakage: confound removal in machine learning leads to leakage
Description:
Abstract Background Machine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine.
Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target.
Problematically, ML models and their predictions can be biased by confounding information present in the features.
To remove this spurious signal, researchers often employ featurewise linear confound regression (CR).
While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood.
Results We provide new evidence that, contrary to general expectations, linear confound regression can increase the risk of confounding when combined with nonlinear ML approaches.
Using a simple framework that uses the target as a confound, we show that information leaked via CR can increase null or moderate effects to near-perfect prediction.
By shuffling the features, we provide evidence that this increase is indeed due to confound-leakage and not due to revealing of information.
We then demonstrate the danger of confound-leakage in a real-world clinical application where the accuracy of predicting attention-deficit/hyperactivity disorder is overestimated using speech-derived features when using depression as a confound.
Conclusions Mishandling or even amplifying confounding effects when building ML models due to confound-leakage, as shown, can lead to untrustworthy, biased, and unfair predictions.
Our expose of the confound-leakage pitfall and provided guidelines for dealing with it can help create more robust and trustworthy ML models.

Related Results

Hydatid Disease of The Brain Parenchyma: A Systematic Review
Hydatid Disease of The Brain Parenchyma: A Systematic Review
Abstarct Introduction Isolated brain hydatid disease (BHD) is an extremely rare form of echinococcosis. A prompt and timely diagnosis is a crucial step in disease management. This ...
Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)
Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)
BACKGROUND As of July 2020, a Web of Science search of “machine learning (ML)” nested within the search of “pharmacokinetics or pharmacodynamics” yielded over 100...
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...
Multiple Water and Sand Leakage Model Tests for Shield Tunnels
Multiple Water and Sand Leakage Model Tests for Shield Tunnels
Water and sand leakage in shield tunnels has become more of a research interest in recent years. On the other hand, accidents involving underground engineering can take many forms ...
Early circulating biomarker signatures of dengue-associated plasma leakage uncovered by proteomics
Early circulating biomarker signatures of dengue-associated plasma leakage uncovered by proteomics
Abstract Background Plasma leakage, a defining feature of severe dengue, lacks early predictive biomarkers critical for timely clinical d...
Radiological appearances of Anastomotic Leakage after Radical Gastrectomy
Radiological appearances of Anastomotic Leakage after Radical Gastrectomy
Abstract Background Anastomotic leakage is a critical postoperative complication after gastric cancer surgery. Previous studies...
Continuous Leakage-Amplified Public-Key Encryption With CCA Security
Continuous Leakage-Amplified Public-Key Encryption With CCA Security
Abstract Secret key leakage has become a security threat in computer systems, and it is crucial that cryptographic schemes should resist various leakage attacks, inc...

Back to Top