Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

REBALANCING DATA FOR CANCER-ASSOCIATED THROMBOSIS: COMPARISON OF DIFFERENT RESAMPLING APPROACH

View through CrossRef
Objective: Cancer-associated thrombosis (CAT) presents a complex challenge in oncology, exacerbated by data imbalances in related datasets that often lead to suboptimal outcomes in machine learning (ML) classification. Many ML algorithms were originally designed for balanced datasets, prompting this study to evaluate the interaction between logistic regression (LR) and eXtreme Gradient Boost (XGBoost) and data resampling techniques for improving prediction on imbalances in Malaysian data on CAT (MDCAT). Methods: Random oversampling (ROS), random undersampling (RUS), and a combined oversampling and undersampling approach (BOTH) were applied to MDCAT dataset. Classification tasks were performed using LR and XGBoost in R version 4.3.1. Classifier performance was assessed using accuracy, sensitivity, specificity, and the area under the ROC curve (AUROC) to evaluate the impact of different resampling techniques. Results: Applying LR and XGBoost to the imbalanced data revealed high specificity but low sensitivity in testing samples. A substantial decline in XGBoost performance was observed, with the AUC decreasing from 0.794 in training to 0.381. Metastasis, surgery, and Indian ethnicity showed statistically significant associated with the CAT event across all resampling techniques. Among XGBoost models, oversampling (XO) exhibited excellent training performance (Accuracy 0.99; AUC 0.98) but showed a large performance drop on the test set (Accuracy 0.82; AUC 0.72). Among LR models, logistic undersampling yielded the highest training accuracy (0.83) and AUC of 0.82. Tuning amplified the differences between resampling strategies and highlighted clear classifier–resampling interactions. XGBoost benefited most, particularly when trained on mixed and oversampled datasets, while LR remained comparatively stable. Conclusion: This study demonstrated that the effectiveness of prediction models in imbalanced MDCAT dataset is strongly influenced by the interaction between classifier characteristics and resampling strategies. A tuned XGBoost model with mixed resampling outweighed the benefits of LR’s simplicity and stability, making it our recommended approach given the primary importance of AUC.
Title: REBALANCING DATA FOR CANCER-ASSOCIATED THROMBOSIS: COMPARISON OF DIFFERENT RESAMPLING APPROACH
Description:
Objective: Cancer-associated thrombosis (CAT) presents a complex challenge in oncology, exacerbated by data imbalances in related datasets that often lead to suboptimal outcomes in machine learning (ML) classification.
Many ML algorithms were originally designed for balanced datasets, prompting this study to evaluate the interaction between logistic regression (LR) and eXtreme Gradient Boost (XGBoost) and data resampling techniques for improving prediction on imbalances in Malaysian data on CAT (MDCAT).
Methods: Random oversampling (ROS), random undersampling (RUS), and a combined oversampling and undersampling approach (BOTH) were applied to MDCAT dataset.
Classification tasks were performed using LR and XGBoost in R version 4.
3.
1.
Classifier performance was assessed using accuracy, sensitivity, specificity, and the area under the ROC curve (AUROC) to evaluate the impact of different resampling techniques.
Results: Applying LR and XGBoost to the imbalanced data revealed high specificity but low sensitivity in testing samples.
A substantial decline in XGBoost performance was observed, with the AUC decreasing from 0.
794 in training to 0.
381.
Metastasis, surgery, and Indian ethnicity showed statistically significant associated with the CAT event across all resampling techniques.
Among XGBoost models, oversampling (XO) exhibited excellent training performance (Accuracy 0.
99; AUC 0.
98) but showed a large performance drop on the test set (Accuracy 0.
82; AUC 0.
72).
Among LR models, logistic undersampling yielded the highest training accuracy (0.
83) and AUC of 0.
82.
Tuning amplified the differences between resampling strategies and highlighted clear classifier–resampling interactions.
XGBoost benefited most, particularly when trained on mixed and oversampled datasets, while LR remained comparatively stable.
Conclusion: This study demonstrated that the effectiveness of prediction models in imbalanced MDCAT dataset is strongly influenced by the interaction between classifier characteristics and resampling strategies.
A tuned XGBoost model with mixed resampling outweighed the benefits of LR’s simplicity and stability, making it our recommended approach given the primary importance of AUC.

Related Results

The Prevalence of JAK2 Mutation in High-Altitude Patients with Unprovoked Thrombosis and Thrombosis at Unusual Sites
The Prevalence of JAK2 Mutation in High-Altitude Patients with Unprovoked Thrombosis and Thrombosis at Unusual Sites
Introduction Thrombosis, both arterial and venous, is a major source of morbidity and mortality in patients with myeloproliferative neoplasms (MPNs). Thrombosis can ...
Edoxaban and Cancer-Associated Venous Thromboembolism: A Meta-analysis of Clinical Trials
Edoxaban and Cancer-Associated Venous Thromboembolism: A Meta-analysis of Clinical Trials
Abstract Introduction Cancer patients face a venous thromboembolism (VTE) risk that is up to 50 times higher compared to individuals without cancer. In 2010, direct oral anticoagul...
Venous Thromboembolism in Denmark: Seasonality in Occurrence and Mortality
Venous Thromboembolism in Denmark: Seasonality in Occurrence and Mortality
Background Many cardiovascular conditions exhibit seasonality in occurrence and mortality, but little is known about the seasonality of venous thromboembolism. Methods ...
Are Cervical Ribs Indicators of Childhood Cancer? A Narrative Review
Are Cervical Ribs Indicators of Childhood Cancer? A Narrative Review
Abstract A cervical rib (CR), also known as a supernumerary or extra rib, is an additional rib that forms above the first rib, resulting from the overgrowth of the transverse proce...
Retinal vein thrombosis and risk of occult cancer: A nationwide cohort study
Retinal vein thrombosis and risk of occult cancer: A nationwide cohort study
AbstractBackgroundRetinal vein thrombosis has in case reports been reported a clinical sign of cancer, especially hematological cancer. However, it is unclear whether retinal vein ...
Risk and Prognosis of Cancer After Lower Limb Arterial Thrombosis
Risk and Prognosis of Cancer After Lower Limb Arterial Thrombosis
Background: Venous thromboembolism can be a presenting symptom of cancer, but the association between lower limb arterial thrombosis and cancer is unknown. We therefore...
RISK HORIZON AND REBALANCING HORIZON IN PORTFOLIO RISK MEASUREMENT
RISK HORIZON AND REBALANCING HORIZON IN PORTFOLIO RISK MEASUREMENT
This paper analyzes portfolio risk and volatility in the presence of constraints on portfolio rebalancing frequency. This investigation is motivated by the incremental risk charge ...

Back to Top