Clustering in multiple long-term conditions: methodological and translational challenges and solutions

Abstract

Clustering, the process of grouping long-term conditions (LTCs) or people based on shared characteristics, disease patterns, risk factors or clinical trajectories, has become a prominent focus of research on multiple long-term conditions (MLTC).^1,2 Its potential for delivering personalised care tailored to specific patient phenotypes is significant, particularly as the burden of MLTC continues to rise globally.^3,4 However, its promises remain largely unfulfilled in clinical practice due to methodological and translational challenges

How clustering works and its targets in MLTC research

Clustering refers to a set of analytical methods that automatically group similar items together based on shared features. In the context of MLTC, these items can be either people or conditions. Algorithms compare hundreds of characteristics – such as age, diagnoses, test results or healthcare use – to identify natural groupings that occur in the data, without researchers specifying them in advance. For example, one cluster might contain people with diabetes, obesity and hypertension who frequently attend the hospital, while another may include younger adults with mental health conditions and chronic pain but lower healthcare use. Each cluster, therefore, represents a group of people with broadly similar health profiles and care needs. The usefulness of clustering lies in its ability to simplify complexity. Rather than treating every patient with multiple conditions as unique, clustering helps clinicians and policymakers understand which patterns commonly occur together, how these groups differ in prognosis and what types of interventions may work best for each group. In practice, this can support risk stratification, inform targeted prevention programmes and guide service design – such as developing integrated clinics for people who fall within high-risk clusters.

In MLTC research, clustering usually aims to identify groups of LTCs, based on their co-occurrence, or clusters of people, based on similarity in their diseases, care needs or healthcare outcomes (also called patient segmentation).^2,5 Of studies clustering people, some do so directly based on a measure of similarity,^5–7 while others first cluster LTCs and subsequently assign individuals according to their conditions.^8,9 The target of clustering should reflect its intended application. For example, LTC clusters inform shared biological mechanisms and therapeutic drug discovery,^1,10 whereas people clusters provide more direct clinical insights, such as for risk stratification or to inform services designed around care of specific clusters.^5,9

Challenges in applying clustering to MLTC

While clustering holds great promise for advancing the understanding and management of MLTC, its translation into clinical practice faces several challenges. These include methodological limitations – such as narrow condition selection and restricted data inputs – as well as issues related to data quality, bias and the absence of meaningful patient and carer involvement. Addressing these challenges requires not only technical advances, including the use of artificial intelligence and longitudinal data, but also improvements in data linkage, inclusivity and research co-design. The following sections outline key challenges and opportunities to overcome them.

Choice of LTCs

A weakness of many clustering approaches in MLTC research is the focus on common conditions only.¹¹ Although individually uncommon, rare diseases are collectively common, affecting 1 in 17 people in the United Kingdom.¹² Rare conditions can have a disproportionate impact on people living with them, often compounded by a lack of knowledge of the disorder by health professionals, which in turn can negatively affect mental health.¹³ People with rare conditions may also experience complex interactions between their rare and more prevalent diseases that require differing treatment approaches. For example, effective management of diabetes or cardiovascular disease in cystic fibrosis must consider the unique challenges posed by the combination.¹⁴ To improve clinical applicability, clustering approaches should include rare conditions, to better capture the diversity of people’s experiences.

Broadening data inputs

Another weakness is a focus on the presence or absence of LTCs as binary health states. Disease severity (such as HbA1c or blood pressure in diabetes), symptoms and impact (such as functional impairment after stroke) are rarely accounted for but substantially influence care needs. Beyond clinical data, clustering often overlooks other crucial factors affecting health. Most studies focus on demographics, biomarkers, genetic predispositions and disease-specific characteristics, but MLTC is influenced by a complex array of social, psychological and lifestyle factors (such as diet, physical activity, smoking and alcohol consumption) that affect disease progression, care needs and health outcomes.¹⁵ For example, socioeconomic status, physical and mental well-being and access to healthcare play a major role in the management of conditions such as diabetes and chronic obstructive pulmonary disease.^16,17 The predominant focus on biological phenotypical factors fails to capture the real-world complex care needs of people with MLTC, limiting the practical value of clusters for personalised patient care.

Limited availability of such data in routinely collected health records further contributes to this challenge, made more difficult by a plethora of technical and governance barriers. Despite some examples of regional success, national mechanisms for linkage of NHS, voluntary sector and local authority social care records are lacking. Neither is there yet a single coding or classification system capable of facilitating streamlined data analysis across all the record systems (e.g. GP, hospital, social care), conceptual domains (e.g. diagnosis, treatment, functioning) and patient-reported data involved. Realising the potential of clustering requires greater linkages with non-health-related data, including social care and patient-reported information.¹⁸

Opportunities using AI

Artificial intelligence (AI) could help address some of these limitations. Machine learning algorithms enable the analysis of large, complex datasets, including many data inputs, allowing for the creation of clusters that better reflect the range of factors affecting people’s health.¹⁹ Similarly, data limitations may in future be partly addressed by natural language processing (NLP) methods capable of extracting information from the richer unstructured data which make up the bulk of information entered during clinical encounters.^20,21 However, several challenges are associated with AI. First, incomplete or biased data can affect the reliability and fairness of AI algorithms, and it remains unclear whether the use of NLP with unstructured data can address this or may exacerbate bias.^22,23 Second, the transparency of AI models and risk of bias remains a concern. Given the potential for AI-based clustering to perpetuate existing biases in healthcare, assessment of fairness must be considered in algorithm design, including assessment of equitable performance across population groups. Clinicians and patients also require clear explanations of how AI algorithms reach their conclusions to promote trust and support their safe and effective implementation in clinical practice, an area in which research is evolving.

Incorporating patient and informal carer priorities

Another weakness limiting the clinical impact of MLTC clustering is insufficient consideration of patient and informal carer opinions on what matters to them regarding disease burden, interactions and associations with adverse outcomes. This is vital given that the experiences of people living with MLTCs are often misunderstood by healthcare staff,²⁴ and patient-centred care often falls short due to this mismatch.²⁵ Furthermore, informal caregiver (such as family, friends or neighbours) experiences are often overlooked, and carer priorities may differ from those of patients.²⁶ Patients and their carers must be involved throughout the research process, from selecting relevant inputs and outcomes, interpreting clusters and co-designing subsequent interventions, to ensure that clustering genuinely informs patient-centred care.

Adopting new methodologies

Advances in clustering methods may enable greater clinical impact. Unsupervised clustering relies on either patient characteristics (such as demographics and clinical conditions) or health outcomes (such as healthcare utilisation or incidence of a new condition). In the first case, people within the same cluster may appear similar, but their outcomes will diverge over time, limiting practical relevance to understanding future care needs. In the second case, people with similar outcomes may have disparate conditions, making it challenging to understand the mechanisms underlying the clustering and tailor management to a given cluster. Supervised methods, including semi-supervised or outcome-aware clustering methods, informed by future clinical events such as hospital admissions or mortality,^27,28 are promising methodological approaches which could enhance the clinical relevance of MLTC clusters by balancing current characteristics and future outcomes.^19,29

MLTC is a dynamic process that evolves over time, but most clustering studies rely on cross-sectional data, which provide a snapshot of people’s characteristics at a single point in time, without considering the order in which they developed.³⁰ Longitudinal clustering using data sources which track people over time, either prospectively or retrospectively, can provide a deeper understanding of how people transition across clusters as they age, and how their experiences and needs change.⁶ AI algorithms, such as transformers, can more accurately predict patient outcomes than methods relying only on static information.^31–33 They can also be integrated into clustering pipelines to produce clusters that better anticipate future care needs and inform long-term care.³⁴

Implementation: embedding translation from the start

While AI offers promise, a translational gap remains between clustering research and its implementation into clinical practice, without clear examples of data-driven clustering informing changes to models of care. One explanation is the methodological focus on the clusters themselves, rather than the practical challenges of implementing cluster-based interventions. Current clustering research is often exploratory, without a clear path to clinical integration, which limits actionable findings. Addressing this requires the objectives to be explicitly defined from the outset, guiding the selection of data inputs, choice of conditions and validation. While validation strategies such as assessing stability and clinical plausibility are important,³⁵ demonstrating the utility of clusters in the real world provides the most impactful validation. Without such evidence of clinical effectiveness, healthcare providers may be reluctant to adopt clustering-informed care models.

Co-designing implementation and evaluation strategies with patients, clinicians and service managers should be considered from the outset. Practical considerations include the feasibility of changes to clinical workflows, including the integration of new tools and the development of training for healthcare professionals and implementation of different models of care. Rigorous evaluation also requires access to high-quality clinical data, which may be hindered by organisational barriers to data sharing and linkage. Without appropriate resources and infrastructure to pilot and evaluate them, clustering will remain a theoretical construct rather than a practical tool for improving care.

Recommendations

Expand clustering data inputs, integrating clinical, lifestyle, psycho-social and patient-reported factors.

Enable integration and linkage of large population-level data sets.

Include rare conditions in clustering analyses.

Embed patient and informal carer perspectives throughout all stages of the research.

Develop semi-supervised, outcome-aware clustering algorithms that produce clusters with similar characteristics and prognostic information.

Advance methodologies for generating clusters incorporating longitudinal information.

Plan implementation and evaluation pathways from the outset of research design.

Conclusion

Clustering has considerable potential to improve MLTC management through personalised and targeted care. Realising this potential requires overcoming the methodological limitations, by reducing reliance on biological factors, including rare conditions and actively incorporating patient and carer perspectives. While AI offers opportunities to address some of these challenges, including generation of outcome-aware and longitudinal clusters, issues surrounding data quality, algorithm transparency and clinical implementation must be addressed. To bridge the gap between research and clinical application, future research should prioritise developing more inclusive, longitudinal and person-centred clustering models and overcoming the real-world barriers to their effective use.

Footnotes

Acknowledgements

TB is supported by the National Institute for Health and Care Research Imperial Biomedical Research Centre. HDM has received funding from the National Institute for Health and Care Research – the Artificial Intelligence for Multiple Long-Term Conditions, or ‘AIM’. ‘The development and validation of population clusters for integrating health and social care: A mixed-methods study on multiple long-term conditions’ (NIHR202637); HDM and KK are supported by the National Institute for Health and Care Research ‘Multiple Long-Term Conditions (MLTC) Cross NIHR Collaboration (CNC)’ (NIHR207000); the views expressed in this publication are those of the author(s) and not necessarily those of the NHS, the National Institute for Health Research or the Department of Health and Social Care.

Declarations

ORCID iDs:

Thomas Beaney

Alex Dregan

Myer Glickman

Kamlesh Khunti

Data availability:

Not applicable.

Use of generative AI:

No generative AI was used during the preparation of this manuscript.

References

Whitty

CJM

Watt

. Map clusters of diseases to tackle multimorbidity. Nature 2020; 579: 494–496.

Busija

Lim

Szoeke

Sanders

McCabe

MP.

Do replicable profiles of multimorbidity exist? Systematic review and synthesis. Eur J Epidemiol 2019; 34: 1025–1053.

The Academy of Medical Sciences. Multimorbidity: a priority for global health research. See https://acmedsci.ac.uk/file-download/82222577 (2018, Accessed 17th July 2025).

Chowdhury

Chandra Das

Sunna

Beyene

Hossain

Global and regional prevalence of multimorbidity in the adult population in community settings: a systematic review and meta-analysis. EClinicalMedicine 2023; 57: 101860.

Yan

Kwan

Tan

Thumboo

Low

LL.

A systematic review of the clinical application of data-driven population segmentation analysis. BMC Med Res Methodol 2018; 18: 121.

Smith

Beaney

Hockham

Elliott

Downey

, et al. Identifying clusters of people with multiple long-term conditions using large language models: a population-based study. NPJ Digital Medicine. DOI: 10.1101/2025.02.14.25322277 2025; 8: 453.

Robertson

Vieira

Butler

Johnston

Sawhney

Black

Identifying multimorbidity clusters in an unselected population of hospitalised patients. Sci Rep 2022; 12: 5134.

Fagbamigbe

Agrawal

Azcoaga-Lorenzo

MacKerron

Özyiğit

Alexander

, et al. Clustering long-term health conditions among 67728 people with multimorbidity using electronic health records in Scotland. PLOS One 2023; 18: e0294666.

Beaney

Clarke

Salman

Woodcock

Majeed

Barahona

Aylin

. Assigning disease clusters to people: a cohort study of the implications for understanding health outcomes in people with multiple long-term conditions. J Multimorb Comorbidity 2024; 14: 26335565241247430.

10.

Hameed

Verspoor

Kusljic

Halgamuge

A two-tiered unsupervised clustering approach for drug repositioning through heterogeneous data integration. BMC Bioinform 2018; 19: 129.

11.

IS-S

Azcoaga-Lorenzo

Akbari

Black

Davies

Hodgins

, et al. Examining variation in the measurement of multimorbidity in research: a systematic review of 566 studies. Lancet Public Health 2021; 6: e587–e597.

12.

Department of Health & Social Care. The UK Rare Diseases Framework. https://www.gov.uk/government/publications/uk-rare-diseases-framework/the-uk-rare-diseases-framework (2021, Accessed 17th July 2025).

13.

Spencer-Tansley

Meade

Ali

Simpson

Hunter

Mental health care for rare disease in the UK – recommendations from a quantitative survey and multi-stakeholder workshop. BMC Health Serv Res 2022; 22: 648.

14.

Ode

Chan

Granados

Moheet

Moran

Brennan

, et al. Cystic fibrosis related diabetes: medical management. J Cyst Fibros 2019; 18: S10–S18.

15.

Fortin

Haggerty

Almirall

Bouhali

Sasseville

Lemieux

Lifestyle factors and multimorbidity: a cross sectional study. BMC Public Health 2014; 14: 686.

16.

Kilvert

Fox

Health inequalities and diabetes. Pract Diabetes 2023; 40: 19–24a.

17.

Gershon

Dolmage

Stephenson

Jackson

Chronic obstructive pulmonary disease and socioeconomic status: a systematic review. COPD J Chronic Obstr Pulm Dis 2012; 9: 216–226.

18.

Edwards

. UK research data resources based on primary care electronic health records: review and summary for potential users. BJGP Open 7(3).

19.

Gao

C X

Filia

Bayer

Bergmeir

An overview of clustering methods with guidelines for application in mental health research. Psychiatry Res 2023; 327: 115265.

20.

Shemtob

Beaney

Norton

Majeed

How can we improve the quality of data collected in general practice?

BMJ 2023; 380: e071950.

21.

Yang

Chen

PourNejatian

Shin

Smith

Parisien

, et al. A large language model for electronic health records. Npj Digit Med 2022; 5: 1–9.

22.

Beaney

Clarke

Salman

Woodcock

Majeed

Barahona

Aylin

, et al. Identifying potential biases in code sequences in primary care electronic healthcare records: a retrospective cohort study of the determinants of code frequency. BMJ Open 2023; 13: e072884.

23.

Ali

Lawson

Wood

Khunti

Addressing ethnic and global health inequalities in the era of artificial intelligence healthcare models: a call for responsible implementation. J R Soc Med 2023; 116: 260–262.

24.

Holland

Matthews

Macdonald

Ashworth

Laidlaw

Cheung

KSY

, et al. The impact of living with multiple long-term conditions (multimorbidity) on everyday life – a qualitative evidence synthesis. BMC Public Health 2024; 24: 3446.

25.

Bellass

Scharf

Errington

Davies

Robinson

Runacres

, et al. Experiences of hospital care for people with multiple long-term conditions: a scoping review of qualitative research. BMC Med. 2024; 22: 25.

26.

Kuluski

Peckham

Gill

Gagnon

Wong-Cornall

McKillop

, et al. What is important to older people with multimorbidity and their caregivers? identifying attributes of person centered care from the user perspective. Int J Integr Care 2019; 19: 4.

27.

Ghasemi

Khorshidi

Aickelin

Multi-objective Semi-supervised clustering for finding predictive clusters. Expert Syst Appl 2022; 195:116551.

28.

Bair

Semi-supervised clustering methods. Wiley Interdiscip Rev Comput Stat 2013; 5: 349–361.

29.

Huang

Liu

Steel

PAD

Axsom

Lee

Tummalapalli

, et al. Deep significance clustering: a novel approach for identifying risk-stratified and predictive patient subgroups. J Am Med Inform Assoc JAMIA 2021; 28: 2641–2653.

30.

Cezard

McHale

Sullivan

Bowles

JKF

Keenan

Studying trajectories of multimorbidity: a systematic scoping review of longitudinal approaches and evidence. BMJ Open 2021; 11: e048485.

31.

Rao

Solares

JRA

Hassaine

Ramakrishnan

Canoy

, et al. BEHRT: transformer for electronic health records. Sci Rep 2020; 10: 7155.

32.

Rasmy

Xiang

Xie

Tao

Zhi

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. Npj Digit. Med 2021; 4: 1–13.

33.

Beaney

Jha

Alaa

Smith

Clarke

Woodcock

, et al. Comparing natural language processing representations of coded disease sequences for prediction in electronic health records. J Am Med Inform Assoc 2024; 31: 1451–1462.

34.

Qiu

Erzurumluoglu

Braenne

Whitehurst

, et al. Deep representation learning for clustering longitudinal survival data from electronic health records. Nat Commun 2025; 16: 2534.

35.

Dhafari

Pate

Azadbakht

Bailey

Rafferty

Jalali-Najafabadi

, et al. A scoping review finds a growing trend in studies validating multimorbidity patterns and identifies five broad types of validation methods. J Clin Epidemiol 2024; 165: 111214.