A robust and generalized framework in diabetes classification across heterogeneous environments

Abstract

Diabetes mellitus (DM) represents a major global health challenge, affecting a diverse range of demographic populations across all age groups. It has particular implications for women during pregnancy and the postpartum period. The contemporary prevalence of sedentary lifestyle patterns and suboptimal dietary practices has substantially contributed to the escalating incidence of this metabolic disorder. The timely identification of diabetes mellitus (DM) in the female population is crucial for preventing related complications and facilitating the implementation of effective therapeutic interventions. However, conventional predictive models frequently demonstrate limited external validity when applied across heterogeneous datasets, potentially compromising clinical utility. This study proposes a robust machine learning (ML) framework for diabetes prediction across diverse populations using two distinct datasets: the PIMA and BD datasets. The framework employs intra-dataset, inter-dataset, and partial fusion dataset validation techniques to comprehensively assess the generalizability and performance of various models. In intra-dataset validation, the Extreme Gradient Boosting (XGBoost) model achieved the highest accuracy on the PIMA dataset with 79%. In contrast, the Random Forest (RF) and Gradient Boosting (GB) models demonstrated accuracy close to 99% on the BD dataset. For inter-dataset validation, where models were trained on one dataset and tested on the other, the ensemble model outperformed others with 88% accuracy when trained on PIMA and tested on BD. However, model performance declined when trained on BD and tested on PIMA (74%), reflecting the challenges of inter-dataset generalization ability. Finally, during partial fusion data validation, the deep learning (DL) model achieved 74% accuracy when trained on the BD dataset augmented with 30% of the PIMA dataset. This accuracy increased to 98% when training on the PIMA dataset combined with 30% of the BD data. These findings emphasize the importance of dataset diversity and the partial fusion dataset that can significantly enhance the model's robustness and generalizability. This framework offers valuable insights into the complexities of diabetes prediction across heterogeneous environments.

Publication DOI: https://doi.org/10.1016/j.compbiomed.2025.109720
Divisions: College of Engineering & Physical Sciences > Aston Digital Futures Institute
College of Engineering & Physical Sciences
Aston University (General)
Additional Information: Copyright © 2025 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/bync-nd/4.0/ ).
Uncontrolled Keywords: Machine learning,Deep learning,Diabetes,Partial fusion data,PIMA dataset,BD dataset,Gestational,Postpartum,Ensemble learning
Publication ISSN: 1879-0534
Last Modified: 11 Mar 2025 18:02
Date Deposited: 10 Feb 2025 17:51
Full Text Link:
Related URLs: https://www.sci ... 0708?via%3Dihub (Publisher URL)
http://www.scop ... tnerID=8YFLogxK (Scopus URL)
PURE Output Type: Article
Published Date: 2025-03
Published Online Date: 2025-01-25
Accepted Date: 2025-01-17
Authors: Zhou, Hejia
Rahman, Saifur
Angelova, Maia (ORCID Profile 0000-0002-0931-0916)
Bruce, Clinton R
Karmakar, Chandan

Export / Share Citation


Statistics

Additional statistics for this record