Javascript must be enabled to continue!
Robust Random Forests for Genomic Prediction: Challenges and Remedies
View through CrossRef
Abstract
Data contamination—from recording errors to extreme outliers—can compromise statistical models by biasing predictions, inflating prediction errors, and, in severe cases, destabilizing performance in high-dimensional settings. Although contamination can affect responses and covariates, we focus on response contamination and evaluate Random Forests through simulation. Using a synthetic animal-breeding dataset, we assess robust Random Forests across several contamination scenarios and validate them on plant and animal datasets. We thereby clarify the consequences of contamination for prediction, develop a robust Random Forest framework, and evaluate its performance. We examine preprocessing or data-transformation strategies, algorithmic modifications, and hybrid approaches for robustifying Random Forests. Across these approaches, data transformation emerges as the most effective strategy, delivering the strongest performance under contamination. This strategy is simple, general, and transferable to other Machine Learning methods, offering a remedy for robust genomic prediction. In real breeding data, robust Random Forests are useful when substantial contamination, phenotypic corruption, misrecording, or train–deployment mismatch is plausible and the goal is to recover a latent signal for genomic prediction and selection; ranking-based robust Random Forests are the dependable first option, whereas weighting-based Random Forests should be used only when their weighting scheme preserves rank structure and improves prediction. Robustification is not universally necessary, but it becomes important when contamination distorts the link between observed responses and the predictive target; standard Random Forests remain the default for clean data, whereas robust Random Forests should be fitted alongside them whenever contamination is plausible, with the final choice guided by data, trait, and breeding objective.
Author summary
Machine learning (ML) methods are widely used for prediction with high-dimensional, complex data, and supervised approaches such as Random Forests (RF) have proved effective for genomic prediction (GP) and selection. Yet their performance can be severely compromised by data contamination if the algorithms rely on classical data-driven procedures that are sensitive to atypical observations. Robustifying ML methods is therefore important both for improving predictive performance under contamination and for guiding their practical use in high-dimensional prediction problems. To address this need, we develop robust preprocessing, algorithm-level, and hybrid strategies for improving RF performance with contaminated data. Using simulated animal data, we show that ranking-and weighting-based robust RF provide the strongest overall compromise for genomic prediction and selection under contamination. Validation on several plant and animal breeding datasets further shows that the benefits of robustification are not universal, but depend on the dataset, trait, and breeding objective. Although motivated by RF, the framework we propose is general, practical, and readily transferable to other ML methods. It also offers a basis for deciding when robustness should complement standard RF rather than replace it outright.
Title: Robust Random Forests for Genomic Prediction: Challenges and Remedies
Description:
Abstract
Data contamination—from recording errors to extreme outliers—can compromise statistical models by biasing predictions, inflating prediction errors, and, in severe cases, destabilizing performance in high-dimensional settings.
Although contamination can affect responses and covariates, we focus on response contamination and evaluate Random Forests through simulation.
Using a synthetic animal-breeding dataset, we assess robust Random Forests across several contamination scenarios and validate them on plant and animal datasets.
We thereby clarify the consequences of contamination for prediction, develop a robust Random Forest framework, and evaluate its performance.
We examine preprocessing or data-transformation strategies, algorithmic modifications, and hybrid approaches for robustifying Random Forests.
Across these approaches, data transformation emerges as the most effective strategy, delivering the strongest performance under contamination.
This strategy is simple, general, and transferable to other Machine Learning methods, offering a remedy for robust genomic prediction.
In real breeding data, robust Random Forests are useful when substantial contamination, phenotypic corruption, misrecording, or train–deployment mismatch is plausible and the goal is to recover a latent signal for genomic prediction and selection; ranking-based robust Random Forests are the dependable first option, whereas weighting-based Random Forests should be used only when their weighting scheme preserves rank structure and improves prediction.
Robustification is not universally necessary, but it becomes important when contamination distorts the link between observed responses and the predictive target; standard Random Forests remain the default for clean data, whereas robust Random Forests should be fitted alongside them whenever contamination is plausible, with the final choice guided by data, trait, and breeding objective.
Author summary
Machine learning (ML) methods are widely used for prediction with high-dimensional, complex data, and supervised approaches such as Random Forests (RF) have proved effective for genomic prediction (GP) and selection.
Yet their performance can be severely compromised by data contamination if the algorithms rely on classical data-driven procedures that are sensitive to atypical observations.
Robustifying ML methods is therefore important both for improving predictive performance under contamination and for guiding their practical use in high-dimensional prediction problems.
To address this need, we develop robust preprocessing, algorithm-level, and hybrid strategies for improving RF performance with contaminated data.
Using simulated animal data, we show that ranking-and weighting-based robust RF provide the strongest overall compromise for genomic prediction and selection under contamination.
Validation on several plant and animal breeding datasets further shows that the benefits of robustification are not universal, but depend on the dataset, trait, and breeding objective.
Although motivated by RF, the framework we propose is general, practical, and readily transferable to other ML methods.
It also offers a basis for deciding when robustness should complement standard RF rather than replace it outright.
Related Results
An exploration of related homoeopathic remedies and fear and over-care Bachflower remedies
An exploration of related homoeopathic remedies and fear and over-care Bachflower remedies
INTRODUCTION The research topic is, “An exploration of related homoeopathic remedies and fear and over-care Bach Flower Remedies’’. This is an exploratory, literature based study. ...
Spa forests in Poland as forests with special legal status – selected issues
Spa forests in Poland as forests with special legal status – selected issues
The subject of the article were spa forests in Poland as forests with special legal status. Due to the lack of a legal definition of this term, the aim of the article was to charac...
A thematic group analysis of three indigenous Bitis spp remedies
A thematic group analysis of three indigenous Bitis spp remedies
Introduction This group analysis study was aimed at the intention to extend the materia medica knowledge of the under-utilized homoeopathic remedies namely the three indigenous Bit...
Best Prediction of the Additive Genomic Variance in Random-Effects Models
Best Prediction of the Additive Genomic Variance in Random-Effects Models
ABSTRACT
The additive genomic variance in linear models with random marker effects can be defined as a random variable that is in accordance with classical quantita...
Examining the Correlation Between Clinical Presentations of Coronavirus Disease 2019 (COVID-2019) Infected Pneumonia and the Utilisation of Natural Remedies for Alleviation
Examining the Correlation Between Clinical Presentations of Coronavirus Disease 2019 (COVID-2019) Infected Pneumonia and the Utilisation of Natural Remedies for Alleviation
Background: The COVID-19 pandemic has exposed various clinical presentations, including respiratory, gastrointestinal, musculoskeletal and neurological symptoms. In this scenario, ...
Awareness and use of home remedies in Italy’s alps: a population-based cross-sectional telephone survey
Awareness and use of home remedies in Italy’s alps: a population-based cross-sectional telephone survey
Abstract
Background
Belief in complementary and alternative medicine practices is related to reduced preparedness for vaccination. This study aimed ...
Accuracy and computational efficiency of genomic selection with high-density SNP and whole-genome sequence data.
Accuracy and computational efficiency of genomic selection with high-density SNP and whole-genome sequence data.
Abstract
The prediction of complex or quantitative traits from single nucleotide polymorphism (SNP) genotypes has transformed livestock and plant breeding, and is...
Tropical Forests
Tropical Forests
Abstract
Tropical forests occupy approximately 10% of the world's total land area, but they play a disproportionate role in global carbon and wa...

