Logo image
Enhancing disease clustering through symptom-based analysis and large language model interpretations
Journal article   Open access   Peer reviewed

Enhancing disease clustering through symptom-based analysis and large language model interpretations

Efe Onojete, Ebuka Ibeke, Chinedu Pascal Ezenkwu, Professor Celestine Iwendi and Imed Ben Dhaou
Scientific Reports, Vol.15, 36651
21/10/2025

Abstract

Diseases Large language model Symptoms Interpretability Clustering Unsupervised learning Machine Learning
Humans face various diseases that are mainly caused by environmental conditions and living habits. These diseases exhibit several symptoms and can share a relationship based on their symptoms. The identification and interpretation of these groups of symptom-based diseases can aid in developing treatment plans for a new outbreak of disease. This research explores the intersection of machine learning and healthcare, specifically focusing on the enhancement of disease classification through symptom-based cluster analysis. By leveraging unsupervised machine learning algorithms, patterns and relationships within diverse symptom datasets were identified, revealing novel associations and subtypes in disease manifestation. The integration of a Large Language Model (LLM), specifically OpenAI's Generative Pretrained Transformer(GPT), played a pivotal role in interpreting and communicating the complex outputs of the machine learning process. The results indicated a significant improvement in defining distinct clusters based on the relationship between diseases and symptoms, with GPT-4o providing simplified explanations that bridge the gap between machine-generated insights and healthcare professional's understanding. The study's findings offer a more profound understanding of the distinctive features characterising the different clusters of diseases generated by the machine learning models. The healthcare field produces extensive and varied data, which machine learning algorithms can leverage to detect new illnesses and optimize treatment plans 1. Deep learning (DL), when trained on high-quality data, has significantly advanced clinical diagnostics and facilitated disease clustering 2. One example is symptom-based clustering, which can enhance diagnostic accuracy and support personalized patient care 3. Diseases with overlapping symptoms pose significant challenges for accurate clinical diagnosis, a problem that can be mitigated through coordinated care and collaboration between multidisciplinary teams 4. Traditionally, physical exams or laboratory tests are used to identify diseases. This process can be complicated and sometimes inaccurate, as many diseases share similar symptoms 5. ML-enabled techniques help to discover new disease subtypes and understand the diversity of the patient population by uncovering hidden patterns within complex data sets 6. Symptom-based cluster analysis is an effective technique for providing precise and targeted medical information 7. However, interpreting these complex models poses a unique challenge. Watson 8 argued that while clustering algorithms efficiently reveal connections, converting these clusters and patterns into meaningful medical insights is difficult.
pdf
Enhancing disease clustering through symptom-based analysis and large language model interpretations2.05 MBDownloadView
Published (Version of record) Open Access CC BY-NC-ND V4.0  — You are free to: Share — copy and redistribute the material in any medium or format The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms: Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. NonCommercial — You may not use the material for commercial purposes. NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material. No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
url
Link to Published VersionView
Published (Version of record) Open CC BY-NC-ND V4.0  — You are free to: Share — copy and redistribute the material in any medium or format The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms: Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. NonCommercial — You may not use the material for commercial purposes. NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material. No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Metrics

21 File views/ downloads
16 Record Views
2 Times Cited - Scopus

Details

UN Sustainable Development Goals (SDGs)

This output has contributed to the advancement of the following goals:

#3 Good Health and Well-Being
Logo image

Usage Policy