Javascript must be enabled to continue!
Confound-leakage: confound removal in machine learning leads to leakage
View through CrossRef
Abstract
Background
Machine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine. Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target. Problematically, ML models and their predictions can be biased by confounding information present in the features. To remove this spurious signal, researchers often employ featurewise linear confound regression (CR). While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood.
Results
We provide new evidence that, contrary to general expectations, linear confound regression can increase the risk of confounding when combined with nonlinear ML approaches. Using a simple framework that uses the target as a confound, we show that information leaked via CR can increase null or moderate effects to near-perfect prediction. By shuffling the features, we provide evidence that this increase is indeed due to confound-leakage and not due to revealing of information. We then demonstrate the danger of confound-leakage in a real-world clinical application where the accuracy of predicting attention-deficit/hyperactivity disorder is overestimated using speech-derived features when using depression as a confound.
Conclusions
Mishandling or even amplifying confounding effects when building ML models due to confound-leakage, as shown, can lead to untrustworthy, biased, and unfair predictions. Our expose of the confound-leakage pitfall and provided guidelines for dealing with it can help create more robust and trustworthy ML models.
Oxford University Press (OUP)
Title: Confound-leakage: confound removal in machine learning leads to leakage
Description:
Abstract
Background
Machine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine.
Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target.
Problematically, ML models and their predictions can be biased by confounding information present in the features.
To remove this spurious signal, researchers often employ featurewise linear confound regression (CR).
While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood.
Results
We provide new evidence that, contrary to general expectations, linear confound regression can increase the risk of confounding when combined with nonlinear ML approaches.
Using a simple framework that uses the target as a confound, we show that information leaked via CR can increase null or moderate effects to near-perfect prediction.
By shuffling the features, we provide evidence that this increase is indeed due to confound-leakage and not due to revealing of information.
We then demonstrate the danger of confound-leakage in a real-world clinical application where the accuracy of predicting attention-deficit/hyperactivity disorder is overestimated using speech-derived features when using depression as a confound.
Conclusions
Mishandling or even amplifying confounding effects when building ML models due to confound-leakage, as shown, can lead to untrustworthy, biased, and unfair predictions.
Our expose of the confound-leakage pitfall and provided guidelines for dealing with it can help create more robust and trustworthy ML models.
Related Results
Hydatid Disease of The Brain Parenchyma: A Systematic Review
Hydatid Disease of The Brain Parenchyma: A Systematic Review
Abstarct
Introduction
Isolated brain hydatid disease (BHD) is an extremely rare form of echinococcosis. A prompt and timely diagnosis is a crucial step in disease management. This ...
P1100IMPACT OF THE RELATIONSHIP BETWEEN LOW MOLECULAR WEIGHT PROTEINS REMOVAL EFFICIENCIES AND THE VOLUME OF ALBUMIN LEAKAGE IN HEMODIAFILTRATION THERAPY
P1100IMPACT OF THE RELATIONSHIP BETWEEN LOW MOLECULAR WEIGHT PROTEINS REMOVAL EFFICIENCIES AND THE VOLUME OF ALBUMIN LEAKAGE IN HEMODIAFILTRATION THERAPY
Abstract
Background and Aims
The purpose of hemodiafiltration (HDF) is to remove small- to large molecular weight solutes in ord...
Multiple Water and Sand Leakage Model Tests for Shield Tunnels
Multiple Water and Sand Leakage Model Tests for Shield Tunnels
Water and sand leakage in shield tunnels has become more of a research interest in recent years. On the other hand, accidents involving underground engineering can take many forms ...
Early circulating biomarker signatures of dengue-associated plasma leakage uncovered by proteomics
Early circulating biomarker signatures of dengue-associated plasma leakage uncovered by proteomics
Abstract
Background
Plasma leakage, a defining feature of severe dengue, lacks early predictive biomarkers critical for timely clinical d...
An Approach to Machine Learning
An Approach to Machine Learning
The process of automatically recognising significant patterns within large amounts of data is called "machine learning." Throughout the last couple of decades, it has evolved into ...
Radiological appearances of Anastomotic Leakage after Radical Gastrectomy
Radiological appearances of Anastomotic Leakage after Radical Gastrectomy
AbstractBackgroundAnastomotic leakage is a critical postoperative complication after gastric cancer surgery. Previous studies have not specified radiological findings of anastomoti...
Initial Experience with Pediatrics Online Learning for Nonclinical Medical Students During the COVID-19 Pandemic
Initial Experience with Pediatrics Online Learning for Nonclinical Medical Students During the COVID-19 Pandemic
Abstract
Background: To minimize the risk of infection during the COVID-19 pandemic, the learning mode of universities in China has been adjusted, and the online learning o...
Art in the Age of Machine Learning
Art in the Age of Machine Learning
An examination of machine learning art and its practice in new media art and music.
Over the past decade, an artistic movement has emerged that draws on machine lear...

