Abstract
In this opinion piece, we argue that sex- and gender-based equity must become a foundational criterion in the design and implementation of digital patient twins. Digital patient twins offer a promising avenue for precision medicine by simulating individual health states and treatment responses. However, their clinical utility and fairness depend on whether diverse patient populations are adequately represented and accounted for in the data and devices on which these models are built. Drawing on evidence from cardiology, endocrinology, mental health, and medical device research, this article shows how current digital patient twin initiatives often overrepresent male, white, and socioeconomically privileged populations, while women, gender-diverse individuals, and people of color remain underrepresented. These imbalances can lead to systematic misdiagnoses, misinterpretation of physiological variation, and measurement inaccuracies. Documented examples include under-recognition of heart failure with preserved ejection fraction in women, omission of menstrual cycle-related changes in glycemic control, underdiagnosis of depression in women by speech-based AI models, and oxygen saturation overestimation in patients with darker skin tones. We argue that these disparities are rooted in structural biases in clinical research and are perpetuated when sex- and gender-specific variables, intersectional factors, and subgroup validation are absent from model design. Addressing these limitations requires balanced data representation, integration of sex- and gender-informed knowledge, participatory design with diverse patient groups, subgroup performance testing, transparent reporting, and mitigation of device-related bias. We contend that these are not optional refinements but prerequisites for realizing the promise of personalized care without reproducing or deepening existing health inequities.
Keywords
Introduction
Digital patient twins (DPTs) are virtual models that integrate multi-modal data—from imaging and genomics to continuous sensor streams—to simulate an individual's physiological state and disease trajectory. They support precision medicine by allowing clinicians and patients to test “what-if” scenarios and tailor treatments to the person rather than the population average. However, the clinical value and ethical legitimacy of DPTs depend on the quality of data and assumptions upon which they are built. Machine learning algorithms, often integral to DPT architectures, can replicate and intensify sex, gender, racial, and socioeconomic biases when trained on unbalanced datasets. 1 Historically, medicine has used male bodies as the default model, and merely including women without analyzing sex-specific differences does not correct this bias. 2 Clinical AI systems risk perpetuating such structural legacies when their design overlooks social and cultural dynamics embedded in the data. 1 When training data lack demographic breadth, algorithmic decisions may lead to misclassification, inappropriate treatment decisions, and reduced clinical safety.
Several studies of DPT initiatives point to a persistent gender data gap. Many models rely on datasets that overrepresent male, white, and affluent populations, while women, trans and non-binary persons, and other marginalized groups remain underrepresented. For example, a philosophical and bioethical examination of trans-specific digital twins highlights that underlying medical datasets predominantly reflect the “male, cis-gendered body,” leaving female and queer individuals largely invisible. 3 Likewise, a scoping review of digital twins warns that biased health datasets can be heavily skewed (by gender or race) and generate suboptimal treatment recommendations. 4 These concerns echo findings from broader studies on AI in medicine, which show that algorithms trained predominantly on male populations tend to misidentify heart attacks in women and that 81% of genome-wide association data derive from people of European ancestry, limiting their predictive accuracy for more diverse populations. 5 These patterns point not only to demographic imbalance, but also to a deeper misalignment between the andro- and Eurocentric normative claims of precision medicine and the design of its computational foundations.
The root of the problem often lies in upstream data sources. Clinical trials and biomedical datasets have historically excluded women and gender-diverse individuals. A systematic review reports that persistent underrepresentation of women remains in multiple research areas and a lack of sex-stratified analyses, despite higher adverse drug reactions in female patients. 6 Early-phase trials continue to exclude women due to perceived reproductive risks and the lingering assumption that the male physiology represents the human standard. 7 A meta-analysis of machine-learning studies in rheumatoid arthritis found that most papers failed to acknowledge sex bias; only three assessed model performance by sex, and none implemented any corrections. 8 Because DPTs often draw on these sources, their clinical performance may be selectively reliable, optimized for some, but not for all.
Addressing these limitations requires intersectional and inclusive frameworks. A lexicon for “digital health diversity” stresses that categories such as gender, ethnicity, age, and sexual identity must be considered jointly to tackle digital inequity. It also calls for greater attention to social determinants of health, like access to technology and digital literacy. 9 Empirical work demonstrates that sex-specific models yield more accurate predictions: for instance, classifiers for women outperformed generic models in predicting antiarrhythmic drug response. Katsoulakis et al. emphasize that fair and representative datasets, transparent governance, and protections against socioeconomic inequalities are essential to prevent DPTs from widening health disparities. 4 Similarly, research on digital twins for female pelvic floor conditions confirms their utility if privacy and access are adequately addressed. 10 These examples illustrate how inclusion, when operationalized, can enhance the precision and safety of DPTs.
In sum, many current DPT systems are limited by narrow population assumptions and inherited data structures. Realizing the promise of truly personalized medicine requires the systematic integration of sex- and gender-specific data, socioeconomic context, intersectional perspectives, and co-creative development processes. This entails active involvement of diverse patient groups in data acquisition, model training, and validation to ensure fairness and inclusivity in future DPT systems.
As an opinion piece, this article draws on evidence from cardiology, endocrinology, mental health, and medical device research to show how demographic biases manifest in DPT-relevant domains. We argue that equity must be embedded from the outset, across design, validation, and governance, if DPTs are to fulfill their promise for all patients.
Evidence across clinical domains
Sex differences in cardiovascular disease
Heart failure is not a single disease but comprises multiple phenotypes. Heart failure with preserved ejection fraction (HFpEF) is more prevalent in women than in men and is associated with distinct risk factors such as obesity, hypertension, pregnancy-related complications, and hormonal changes. 11 Women represent approximately 55% of HFpEF patients, but only 29% of those with heart failure with reduced ejection fraction. 11 Historically, clinical trials and registries have under-enrolled women, leading to predictive models and therapeutic guidelines that are primarily based on male-dominant data. If DPTs for heart failure rely on these datasets, they may underrecognize HFpEF and misclassify women's symptoms.
Hormonal cycles and glycemic control
In women with type 1 diabetes, insulin dosing is typically adjusted based on glucose levels, dietary intake, and physical activity. However, hormonal fluctuations over the menstrual cycle also affect how the body responds to insulin and how stable glucose levels remain, factors that are often not considered in both clinical care and model development. In a prospective study using continuous glucose monitoring. At the same time, their bodies responded less effectively to insulin, a condition known as insulin sensitivity, even though total daily insulin doses remained stable. 12 Further evidence from a simulation-based analysis confirmed this pattern. Researchers use a technique known as euglycemic clamp, a gold-standard method to measure how much insulin is needed to keep blood glucose stable. The results showed that during the luteal phase, the glucose infusion rate, a proxy for insulin sensitivity, was significantly lower. This meant that participants spent more time with a glucose level over 180 mg/dl, while the recommended target after meals is typically below 140 mg/dl after 2 h (postprandial). When algorithms were informed by cycle-related hormonal variations in insulin sensitivity, the model's performance improved, and glucose stability across the cycle increased. 13 A systematic review supports these findings, noting that a subset of women with type 1 diabetes consistently experiences higher blood sugar levels and reduced insulin sensitivity in the luteal phase compared to the first half of the cycle (the follicular phase). 14
These results imply that DPTs for diabetes that do not include menstrual cycle information may misinterpret hormone-driven blood sugar fluctuations as poor adherence or lifestyle-related failure. This can lead to unnecessary insulin increases and a higher risk of treatment-related complications, such as a reduction in blood glucose concentration (<60 mg/dl; hypoglycemia). Incorporating cycle-phase data and sex-specific physiology into model inputs, along with sex-stratified validation, can ensure that recommendations are personalized and help prevent iatrogenic hypoglycemia.12,13
Mental health AI and gendered patterns of detection
Recent work in computational psychiatry has highlighted that AI systems used to assess mental health often rely on data such as voice recordings, written text, facial expressions, and/or physiological signals. These inputs, however, are shaped by gender, language use, cultural norms, and social context. 15 A recent review of bias in mental health AI warns that emotional expression varies significantly across gender, race, and ethnicity. Tools based on natural language processing (NLP), which aim to detect mood or distress from spoken or written language, often perform differently across demographic groups. 15 For instance, automated speech transcription methods, which are often used as input features, produce more word-detection errors for women and minorities than for non-Hispanic white men. 15 A study found that speech-based recognition systems flagged equal numbers of men and women as being at risk of depression, even though the women in the sample had higher depression rates. This suggests that the system underdiagnosed distress in women, likely due to misaligned language features or model assumptions. Recognizing these limitations, researchers in mental health AI have argued that a one-size-fits-all approach is unrealistic. Instead, models may need to be tailored to specific population groups and trained on datasets that include relevant demographic variables. This allows the system to learn population-specific patterns between behavior and mood, 16 and to detect risk more equitably across groups.
These insights are directly relevant for DPTs designed to simulate mental health trajectories. If such models do not account for differences in gendered communication styles, symptom presentation, and social context, they may fail to recognize clinically significant patterns. Ensuring that training datasets are balanced by sex and gender, including behavioral and physiological markers specific to different groups and calibrating model outputs accordingly, can help prevent under- or over-diagnosis. 15 Without such safeguards, divergence from model expectations may be wrongly attributed to patient non-compliance or lead to incorrect diagnostic labeling, rather than being understood as a limitation of the system's demographic sensitivity.
Device and measurement biases across skin tones
Digital twins depend not only on clinical and observational data but also on accurate continuous inputs from sensor-based devices. These measurements are often treated objectively, yet the devices themselves can introduce demographic bias. One widely discussed example is the pulse oximeter, a device used to estimate blood oxygen saturation (SpO₂). Studies have shown that readings from pulse oximeters tend to be systematically higher in individuals with darker skin compared to direct arterial blood gas measurements. A review of 22 studies found that this overestimation was consistent and that the variability was greater in Black patients than in white patients. 17 As a result, some Black patients with dangerously low oxygen levels, below 88% arterial saturation, still displayed normal or near-normal SpO₂ values (between 92% and 96%) on the device. 17
If DPTs integrate sensor data without accounting for such measurement bias, they may misjudge a patient's oxygenation status. This could lead to inappropriate clinical recommendations, such as insufficient oxygen therapy or missed escalation of care. Moreover, since pulse oximeters are often used in continuous monitoring and as triggers for clinical interventions, these biases can become embedded in DPT logic without ever being questioned. Ensuring that DPTs function equitably across populations, therefore, requires more than diverse training data. It also means scrutinizing the hardware inputs themselves. Device calibration, validation across skin tones, and the use of alternative measurement technologies should be considered part of demographic robustness, not as afterthoughts, but as foundational design requirements.
Equity by design: Strategies for demographic robust DPTs
These examples demonstrate that demographic variation is not a marginal concern in DPT development. When models are trained on narrow datasets, or when measurement tools systematically misrepresent specific groups, the resulting systems may fail entire patient populations. Addressing these limitations requires more than technical refinement. Therefore, we argue for a systematic shift in how data are collected, how models are developed and validated, and how fairness is defined and operationalized across the DPT lifecycle. To move from diagnostic critique to actionable reform, we outline in this section, six strategies for designing DPTs that are not only technically sophisticated but also socially responsive, ethically responsible, and clinically equitable:
Beyond DPT design: Toward structural change
DPTs hold immense promise for advancing personalized medicine. Yet this promise is contingent on representativeness and quality of the data, assumptions, and measurement tools they rely on. Evidence from cardiology, endocrinology, mental health, and medical devices shows that when women and racialized populations are underrepresented, the consequences are not hypothetical: they include misdiagnoses, misinterpretation of physiological variation, and clinically significant measurement errors.2,11,17
But bias is not an isolated design flaw. It reflects broader systematic patterns of exclusion across biomedical research, clinical trials, and regulatory frameworks, patterns that become operational through model architectures and decision rules. Design is not the origin of these structures, but it is where they are enacted. And because it is a site of enactment, it is also a site of intervention. Nevertheless, more equitable design alone will not suffice. Technical corrections cannot compensate for upstream exclusions in data generation, research prioritization, or health system access. Efforts to build fair and inclusive DPTs must therefore be accompanied, if not preceded, by structural reforms that challenge the conditions under which exclusion is produced and legitimized. Only when design and system change are pursued in tandem can personalization become truly inclusive.
If precision medicine aspires to live up to its name, it must be precise for
Opportunities and future directions
Encouragingly, the DPT paradigm itself can be part of the solution. When these models are deliberately designed for inclusion, they can highlight and even help correct biases in healthcare. For example, incorporating female-specific physiological variations into a diabetes DPT has been shown to improve model accuracy and patient glucose control. 13 Likewise, digital twin projects focusing on women's health have demonstrated the technology's potential: a recent female pelvic-floor digital twin confirmed that such tailored models are both feasible and clinically valuable. 10 These cases exemplify how paying attention to sex and gender can directly enhance the performance and fairness of digital twin models.
More broadly, new initiatives and research efforts are laying the groundwork for more equitable DPTs. Large-scale programs like the NIH's All of Us are enriching datasets with diverse participants, which will, in turn, improve the robustness of future DPT models. Researchers are also exploring participatory, co-creative development processes to ensure digital twins serve the needs of marginalized communities. 2 Each strategy outlined in this paper represents an opportunity for innovation: by embracing diverse data, interdisciplinary collaboration, and structural change, DPT developers can build systems that actively reduce disparities. With growing awareness and commitment, the next generation of DPTs could become powerful tools for closing the gender data gap and fostering health equity, ultimately fulfilling the promise of precision medicine for all.
Footnotes
Acknowledgements
The authors would like to thank Dr Theresa Ahrens and the Digital Health Engineering team at Fraunhofer IESE for their valuable input and collaboration. We also acknowledge the Research Group “Life, Innovation, Health, and Technology (LIGHT)” at the Institute for Technology Assessment and Systems Analysis (ITAS), Karlsruhe Institute of Technology (KIT), for their constructive feedback and support.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the European Union under
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
