Javascript must be enabled to continue!

Robust Random Forests for Genomic Prediction: Challenges and Remedies

Abstract Data contamination—from recording errors to extreme outliers—can compromise statistical models by biasing predictions, inflating prediction errors, and, in severe cases, destabilizing performance in high-dimensional settings. Although contamination can affect responses and covariates, we focus on response contamination and evaluate Random Forests through simulation. Using a synthetic animal-breeding dataset, we assess robust Random Forests across several contamination scenarios and validate them on plant and animal datasets. We thereby clarify the consequences of contamination for prediction, develop a robust Random Forest framework, and evaluate its performance. We examine preprocessing or data-transformation strategies, algorithmic modifications, and hybrid approaches for robustifying Random Forests. Across these approaches, data transformation emerges as the most effective strategy, delivering the strongest performance under contamination. This strategy is simple, general, and transferable to other Machine Learning methods, offering a remedy for robust genomic prediction. In real breeding data, robust Random Forests are useful when substantial contamination, phenotypic corruption, misrecording, or train–deployment mismatch is plausible and the goal is to recover a latent signal for genomic prediction and selection; ranking-based robust Random Forests are the dependable first option, whereas weighting-based Random Forests should be used only when their weighting scheme preserves rank structure and improves prediction. Robustification is not universally necessary, but it becomes important when contamination distorts the link between observed responses and the predictive target; standard Random Forests remain the default for clean data, whereas robust Random Forests should be fitted alongside them whenever contamination is plausible, with the final choice guided by data, trait, and breeding objective. Author summary Machine learning (ML) methods are widely used for prediction with high-dimensional, complex data, and supervised approaches such as Random Forests (RF) have proved effective for genomic prediction (GP) and selection. Yet their performance can be severely compromised by data contamination if the algorithms rely on classical data-driven procedures that are sensitive to atypical observations. Robustifying ML methods is therefore important both for improving predictive performance under contamination and for guiding their practical use in high-dimensional prediction problems. To address this need, we develop robust preprocessing, algorithm-level, and hybrid strategies for improving RF performance with contaminated data. Using simulated animal data, we show that ranking-and weighting-based robust RF provide the strongest overall compromise for genomic prediction and selection under contamination. Validation on several plant and animal breeding datasets further shows that the benefits of robustification are not universal, but depend on the dataset, trait, and breeding objective. Although motivated by RF, the framework we propose is general, practical, and readily transferable to other ML methods. It also offers a basis for deciding when robustness should complement standard RF rather than replace it outright.

openRxiv

Vanda M. Lourenço Joseph O. Ogutu Hans-Peter Piepho

2026

Title: Robust Random Forests for Genomic Prediction: Challenges and Remedies

Description:

Although contamination can affect responses and covariates, we focus on response contamination and evaluate Random Forests through simulation.

Using a synthetic animal-breeding dataset, we assess robust Random Forests across several contamination scenarios and validate them on plant and animal datasets.

We thereby clarify the consequences of contamination for prediction, develop a robust Random Forest framework, and evaluate its performance.

We examine preprocessing or data-transformation strategies, algorithmic modifications, and hybrid approaches for robustifying Random Forests.

Across these approaches, data transformation emerges as the most effective strategy, delivering the strongest performance under contamination.

This strategy is simple, general, and transferable to other Machine Learning methods, offering a remedy for robust genomic prediction.

In real breeding data, robust Random Forests are useful when substantial contamination, phenotypic corruption, misrecording, or train–deployment mismatch is plausible and the goal is to recover a latent signal for genomic prediction and selection; ranking-based robust Random Forests are the dependable first option, whereas weighting-based Random Forests should be used only when their weighting scheme preserves rank structure and improves prediction.

Robustification is not universally necessary, but it becomes important when contamination distorts the link between observed responses and the predictive target; standard Random Forests remain the default for clean data, whereas robust Random Forests should be fitted alongside them whenever contamination is plausible, with the final choice guided by data, trait, and breeding objective.

Author summary Machine learning (ML) methods are widely used for prediction with high-dimensional, complex data, and supervised approaches such as Random Forests (RF) have proved effective for genomic prediction (GP) and selection.

Yet their performance can be severely compromised by data contamination if the algorithms rely on classical data-driven procedures that are sensitive to atypical observations.

Robustifying ML methods is therefore important both for improving predictive performance under contamination and for guiding their practical use in high-dimensional prediction problems.

To address this need, we develop robust preprocessing, algorithm-level, and hybrid strategies for improving RF performance with contaminated data.

Using simulated animal data, we show that ranking-and weighting-based robust RF provide the strongest overall compromise for genomic prediction and selection under contamination.

Validation on several plant and animal breeding datasets further shows that the benefits of robustification are not universal, but depend on the dataset, trait, and breeding objective.

Although motivated by RF, the framework we propose is general, practical, and readily transferable to other ML methods.

It also offers a basis for deciding when robustness should complement standard RF rather than replace it outright.

Back

During Samuel Hahnemann’s time, it was possibly simpler for homoeopaths to prescribe, as there were fewer remedies. Nowadays, there are thousands of remedies available to homoeopat...

An exploration of related homoeopathic remedies and fear and over-care Bachflower remedies

INTRODUCTION The research topic is, “An exploration of related homoeopathic remedies and fear and over-care Bach Flower Remedies’’. This is an exploratory, literature based study. ...

Spa forests in Poland as forests with special legal status – selected issues

The subject of the article were spa forests in Poland as forests with special legal status. Due to the lack of a legal definition of this term, the aim of the article was to charac...

A thematic group analysis of three indigenous Bitis spp remedies

Introduction This group analysis study was aimed at the intention to extend the materia medica knowledge of the under-utilized homoeopathic remedies namely the three indigenous Bit...

Examining the Correlation Between Clinical Presentations of Coronavirus Disease 2019 (COVID-2019) Infected Pneumonia and the Utilisation of Natural Remedies for Alleviation

Background: The COVID-19 pandemic has exposed various clinical presentations, including respiratory, gastrointestinal, musculoskeletal and neurological symptoms. In this scenario, ...

Best Prediction of the Additive Genomic Variance in Random-Effects Models

ABSTRACT The additive genomic variance in linear models with random marker effects can be defined as a random variable that is in accordance with classical quantita...

ANALISIS EVALUASI EKONOMI SUMBER DAYA ALAM DAN LINGKUNGAN DI KABUPATEN KONAWE SELATAN

The condition of natural resources of forests and agricultural land, especially in South Konawe Regency, is currently quite a concern. South Konawe Regency has a forest area with a...

Awareness and use of home remedies in Italy’s alps: a population-based cross-sectional telephone survey

Abstract Background Belief in complementary and alternative medicine practices is related to reduced preparedness for vaccination. This study aimed ...

Email:
Password:

Email:

Robust Random Forests for Genomic Prediction: Challenges and Remedies

Related Results