Heart rate circadian phase and hyperarousal as wearable digital phenotyping of insomnia: An interpretable machine learning study

Abstract

Objective

This study evaluates ML approaches for insomnia classification using physiological and behavioral data from wearable devices. SHAP analysis identifies key predictors, highlighting the relationship between sleep disturbances and digital phenotypes and emphasizing clinical plausibility as a criterion for model selection.

Methods

Three hundred thirty-eight participants (249 with insomnia and 89 controls) aged 19–70 years were instructed to wear Fitbit Inspire 3 devices for 4 weeks to record heart rate, activity, and sleep metrics. Insomnia classification was based on Insomnia Severity Index scores (≥8 insomnia and ≤7 controls). Filter- and wrapper-based feature-selection methods were applied to the 120 extracted features. Multiple ML algorithms were evaluated using five-fold cross-validation, with the clinical plausibility of the feature relationships explicitly considered in the final model selection.

Results

LightGBM model trained on 60 ANOVA-selected features achieved the highest performance (F1 score = 0.868 ± 0.027). The key predictive features identified by SHAP analysis included delayed acrophase of the heart rate cosinor rhythm, higher self-reported stress and maximum heart rates that aligned with sleep-wake physiology. However, several features exhibited patterns that contradicted previously known clinical expectations, highlighting the disconnection between statistical optimization and clinical utility.

Conclusion

Machine learning models trained on wearable data can effectively classify insomnia. SHAP analysis suggested that altered activity patterns reflect sleep disturbance, while also highlighting the necessity for further clinical validation. Clinical plausibility must be integrated as a fundamental criterion in model development, to ensure clinically trustworthy ML applications in sleep medicine.

Keywords

insomnia wearable devices circadian rhythm hyperarousal machine learning explainable AI digital phenotyping

Introduction

Insomnia affects approximately 10–30% of the population; however, clinical diagnosis relies primarily on subjective self-report instruments that are prone to recall bias.^1–3 Although wearable technologies and digital phenotyping offer objective approaches for sleep assessment, the translation of high-performing machine learning (ML) models into clinical practice remains challenging owing to interpretability issues.^4,5

Recent advances in wearable artificial intelligence (AI)-powered solutions have demonstrated promising results in the detection of sleep disorders.^6–8 Although growing emphasis on explainable AI in healthcare has focused primarily on technical interpretability methods, few studies have examined whether ML-identified patterns align with clinical knowledge or how contradictory findings influence model selection.^9–11 This gap is particularly concerning in sleep medicine, where complex interactions between behavioral, physiological, and psychological factors require nuanced clinical interpretations.

While predictive performance remains a central benchmark in machine learning research, the clinical deployment of AI systems requires evaluation criteria that extend beyond statistical accuracy. Prior work has emphasized that machine learning systems trained on large observational datasets may generate patterns that appear statistically valid yet lack clinical validity. When model optimization targets a specific performance metric, models may exploit dataset-specific artifacts or latent confounders that improve predictive performance but do not reflect meaningful physiological relationships.¹² Moreover, in the context of healthcare machine learning, excessive optimization may encourage models to capture spurious associations that fail to generalize across populations or clinical contexts.¹³

When models provide recommendations based on counterintuitive patterns, clinicians may lose confidence in AI-assisted decision making, potentially leading to inappropriate interventions or limiting the adoption of beneficial technologies.^14,15 In high-stakes healthcare environments, comprehending the rationale behind AI-driven decisions is essential for maintaining clinical trust and ensuring patient safety.¹⁶ In addition, prior studies have highlighted the risk of a “false sense of certainty” in AI-assisted clinical decision-making, where clinicians may inadvertently rely on model outputs despite underlying inaccuracies or spurious patterns, underscoring the importance of incorporating clinically grounded validation criteria beyond predictive performance alone.¹⁷ These concerns have been increasingly reflected in emerging regulatory and methodological frameworks for trustworthy medical AI. Regulatory guidance from the U.S. Food and Drug Administration for AI/Machine learning-based Software as a Medical Device emphasizes transparency, including the logic or explainability of model outputs.¹⁸ Similarly, the World Health Organization has highlighted explainability as core ethical principles for AI in health.¹⁹ Reporting guidelines such as TRIPOD+AI recommends transparent reporting of interpretability methods and their validation, along with conventional performance measures.²⁰

In this study, we adopt this perspective by explicitly considering clinical plausibility as an additional model-selection criterion when developing machine learning models for insomnia classification using wearable-derived digital phenotypes. Rather than selecting models solely based on predictive performance, we examine whether the relationships identified by each model align with established clinical and physiological knowledge about sleep, circadian rhythms, and behavioral activity patterns.

This study explored the integration of ML models for insomnia classification based on data derived from wearable-derived digital phenotypes, with explicit consideration of clinical plausibility. Our primary objective was (1) to develop and evaluate ML models for insomnia classification using wearable-derived digital phenotypes; and (2) to demonstrate the necessity of integrating clinical plausibility as a fundamental criterion in model selection, beyond predictive performance alone to ensure clinically trustworthy applications.

Methods

Participants and data collection

Study design

This single-center prospective observational study was conducted at Korea University Anam Hospital, Seoul, Republic of Korea, between January 2023 and July 2024. This study is part of an ongoing research program employing a deep phenotyping approach to comprehensively characterize insomnia, conducted from March 2023 to October 2024 (registered at the Clinical Research Information Service: KCT0009175; protocol published in Lee et al., Front. Psychiatry 2025).²¹

Study population

A total of 338 participants aged 19–70 years were recruited from Korea University Anam Hospital between January 2023 and July 2024 as part of the ongoing CRIS-registered study (KCT0009175). Based on Insomnia Severity Index (ISI) scores, 249 participants (73.67%) were classified as the insomnia group (ISI ≥8) and 89 participants (26.33%) as controls (ISI ≤7). In this study, using a lower ISI cutoff score of ≥8 to define the insomnia group allows identification of individuals with clinically relevant insomnia symptoms including subthreshold insomnia that may precede the development of chronic insomnia.² Individuals with intellectual disabilities, organic brain damage, schizophrenia spectrum disorder, ongoing sleep disorder treatment, or those without a smartphone were excluded.

Data collection

Data collection was conducted over 4 weeks using three sources: wearable device monitoring, smartphone-based ecological momentary assessment, and self-reported case report forms (CRFs). Upon enrollment, the participants completed structured CRFs and provided demographic data, family history, current illnesses, and sleep characteristics, and ISI scores were calculated.

Participants wore a wearable device (Fitbit Inspire 3, Fitbit Inc., USA) continuously, which passively recorded their heart rate every 5 seconds and also step count, moving distances, and exercise time every 5 minutes. Daily sleep metrics included total sleep time, REM/light/deep sleep time, sleep onset/offset times, wake after-sleep onset episodes, and sleep efficiency. Data were segmented into weekdays/holidays, daytime (8:00 AM to 6:00 PM), and bedtime (6:00 PM to 8:00 AM) periods.

In addition, participants installed a custom smartphone application ‘SOMDAY’ (Lumanlab Inc, Seoul, Republic of Korea) developed specifically for this study to capture subjective daily lifestyle data, complementing objective wearable data. Daily lifestyle factors, including caffeine and alcohol intake, stress levels, and total nap time, were reported using SOMDAY.^8,22,23

This study was conducted in accordance with the principles of the Declaration of Helsinki. All procedures were reviewed and approved by the Institutional Review Board (IRB) of Korea University Anam Hospital (No. 2022AN0587). All participants provided written informed consent at the beginning of the study, following a clear explanation of the study’s purpose, procedures, potential risks and benefits, data-handling procedures, and the voluntary nature of participation.

Data preprocessing

The heart rate data were used to compute descriptive metrics (maximum, minimum, mean, variance, and standard deviation) for each time segment. Cosinor analysis within 72-h intervals estimated the following circadian rhythm parameters: amplitude, acrophase, MESOR, and goodness of fit.²⁴ Exercise intensity was calculated based on the heart rate relative to the maximum heart rate and classified according to established criteria.

Activity levels were analyzed using step count and moving distance with basic descriptive metrics (maximum, minimum, mean, variance, and standard deviation). Nonparametric circadian features including interdaily stability (IS), intradaily variability (IV), relative amplitude (RA), and mean activity during the least active 5 h (L5) and most active 10 h (M10) captured patterns of daily rest-activity rhythms.¹¹ Cosinor analysis was also performed using the step count data.¹¹

This process yielded a total of 120 features (for a complete list, see Supplementary Material 1). Missing values were primarily due to device non-wear or signal disconnection, and the feature-wise missingness rates are provided in Supplementary Material 2.

To address missing values, a sensitivity analysis was performed comparing four imputation strategies: groupwise mean imputation, MissForest, multivariate imputation by chained equations(MICE), and k-nearest neighbors(KNN). The MissForest algorithm was selected as the primary method because it captures complex, non-linear interactions between variables without requiring prior distributional assumptions. Furthermore, we avoided mean imputation to prevent potential data leakage. Imputation was performed separately within strata defined by ISI score to preserve potential differences in activity patterns across sleep disturbance severity. The performance of different imputation methods was evaluated using a baseline logistic regression model.

Model construction

Feature selection

Python software, version 3.12.3 was used for all analyses. To examine the impact of different feature selection strategies on model performance and interpretability, both filter and wrapper methods were applied. Three statistical filtering methods were applied to score and rank the features, including mutual information, ANOVA, and chi-squared statistics. The models were trained on feature subset sizes ranging from k=1 to k=120, based on the top-k ranked features from each filter method. A model-specific wrapper approach was implemented using the Optuna framework. This method utilized categorical suggestions to determine the inclusion or exclusion of each feature, directly optimizing the model’s objective function based on the resulting feature subsets.²⁵

Machine learning models

Five supervised ML algorithms were compared: logistic regression, Random Forest, XGBoost, LightGBM, and Support Vector Machine.

Model training and validation

Model training was performed using Python scikit-learn v.1.6.1, XGBoost v.2.1.4, Lightgbm v.4.6.0, and Shap v.0.48.0. To perform a binary classification of the insomnia and normal sleep groups, the dataset was split into a training set (70%) and an external validation set (30%). Hyperparameter tuning and feature selection parameters were optimized simultaneously using the Optuna optimization framework. The Tree-structured Parzen Estimator (TPE) sampler was utilized to navigate the search space for 300 trials per model.

Statistical analysis

Performance was assessed using accuracy, precision, recall, and F1 score, with the F1 score serving as the primary selection metric owing to class imbalance. The models were evaluated using stratified five-fold cross-validation with metrics averaged across folds with ten-time repetition. Wilcoxon signed-rank test was conducted in order to compare the F1 score of the best performing model against alternative models.²⁶ Discrimination was further assessed using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC). Calibration of the model was evaluated using the Brier score and calibration curves.

Model interpretability assessment

The interpretability of models was assessed using Shapley’s additive explanation (SHAP) values and summary plots. Clinical plausibility of the feature relationships was explicitly evaluated and integrated into the final model selection decisions, prioritizing alignment with established clinical knowledge alongside statistical performance.Figure 1.

Figure 1.

Schematic diagram of the study method.

Results

Sensitivity analysis of imputation methods

The sensitivity analysis revealed that both Mean Imputation and MissForest yielded comparable predictive performance, with MissForest showing a slight advantage in overall accuracy. Although the F1-score for mean imputation (0.763) was marginally higher than that of MissForest (0.761), the difference was negligible. We prioritized MissForest for the final model development due to its superior ability to maintain the multivariate distribution and data integrity. The full results are presented in Supplementary Material 3.

Model performance

The insomnia group had significantly higher ISI scores (14.8±4.5 vs. 3.7±2.3, p < 0.001) and a delayed heart rate circadian acrophase (2.63±0.41 vs. 2.45±0.41 radians, p = 0.001). Among the filter-based models, LightGBM, trained on 60 ANOVA-selected features achieved superior performance (F1 score = 0.868 ± 0.027). The full list of selected features are listed in Supplementary Material 4. Table 1 presents the performance metrics of the five machine learning models with the optimal numbers of selected features. The mean AUC-ROC was 0.643 ± 0.085, and the curve is depicted in Figure 2(a). The gap between the F1 score (0.868 ± 0.027) and the AUC-ROC is consistent with the marked class imbalance (73.7% insomnia vs. 26.3% controls) and the use of a default 0.5 probability threshold, under which a recall-driven F1 does not directly translate into strong rank-based discrimination. The mean Brier score was 0.188 ± 0.018. The calibration curve showed reasonable overall agreement between predicted probabilities and observed frequencies, with mild deviation in the low-probability region (Figure 2(b)). The Brier score is close to the no-skill reference value (∼0.194) implied by the marginal class distribution, indicating that probability calibration is reasonable in shape but leaves room for refinement.

Table 1.

Optimal number of selected features and model performance metrics for each model trained with different feature selection methods.

Algorithm	Feature selection	k	Accuracy	Precision	Recall	F1 score	p value
Logistic Regression	ANOVA	4	0.743 ± 0.022	0.747 ± 0.013	0.984 ± 0.020	0.849 ± 0.012	0.032
	Chi square	92	0.737 ± 0.007	0.737 ± 0.007	1.000 ± 0.000	0.848 ± 0.005	0.060
	Mutual information	66	0.743 ± 0.022	0.746 ± 0.014	0.988 ± 0.019	0.850 ± 0.013	0.064
	Wrapper	64	0.740 ± 0.017	0.740 ± 0.012	0.996 ± 0.012	0.849 ± 0.010	0.082
Random Forest	ANOVA	3	0.728 ± 0.028	0.743 ± 0.016	0.964 ± 0.028	0.839 ± 0.017	0.001
	Chi square	11	0.752 ± 0.026	0.751 ± 0.018	0.992 ± 0.016	0.855 ± 0.015	0.156
	Mutual information	118	0.743 ± 0.034	0.752 ± 0.017	0.972 ± 0.040	0.847 ± 0.023	0.049
	Wrapper	110	0.746 ± 0.032	0.749 ± 0.018	0.984 ± 0.037	0.851 ± 0.021	0.108
XGBoost	ANOVA	30	0.725 ± 0.054	0.753 ± 0.023	0.932 ± 0.069	0.832 ± 0.038	0.005
	Chi square	20	0.722 ± 0.029	0.740 ± 0.015	0.960 ± 0.044	0.835 ± 0.020	0.003
	Mutual information	95	0.728 ± 0.031	0.737 ± 0.015	0.980 ± 0.032	0.841 ± 0.020	0.011
	Wrapper	54	0.728 ± 0.021	0.734 ± 0.009	0.988 ± 0.026	0.842 ± 0.014	0.027
LightGBM	ANOVA	60	0.784 ± 0.045	0.792 ± 0.029	0.960 ± 0.036	0.868 ± 0.027	-
	Chi square	1	0.760 ± 0.027	0.772 ± 0.023	0.960 ± 0.040	0.855 ± 0.017	0.162
	Mutual information	80	0.740 ± 0.011	0.739 ± 0.010	1.000 ± 0.000	0.850 ± 0.006	0.086
	Wrapper	108	0.737 ± 0.007	0.737 ± 0.007	1.000 ± 0.000	0.848 ± 0.005	0.060
Support Vector Machine	ANOVA	46	0.737 ± 0.007	0.737 ± 0.007	1.000 ± 0.000	0.848 ± 0.005	0.060
	Chi square	8	0.734 ± 0.012	0.736 ± 0.008	0.996 ± 0.012	0.846 ± 0.008	0.048
	Mutual information	73	0.752 ± 0.022	0.750 ± 0.014	0.996 ± 0.012	0.855 ± 0.012	0.207
	Wrapper	82	0.752 ± 0.018	0.748 ± 0.014	1.000 ± 0.000	0.856 ± 0.009	0.130

k, number of selected features; ANOVA, analysis of variance.

Figure 2.

(a) The AUC-ROC and (b) the calibration curve of the best performing ANOVA-filtered LightGBM model.

Clinical interpretability assessment

The SHAP summary plot of the best performing LightGBM model (Figure 3) showed that the most influential feature was HR_CR_acrophase, which clearly showed that delayed acrophase of the heart rate cosinor rhythm elevated the possibility of insomnia prediction. Higher self-reported stress also contributed to prediction of insomnia, while lower stress did not show a clear correlation. Higher maximum heart rates during daytime in both weekdays and holidays were related with lower risk of insomnia.

Figure 3.

SHAP summary plot of the ANOVA-filtered LightGBM model (k=60), demonstrating the best F1 score. The top 20 features with the highest SHAP values. Details of the interpretation of SHAP analysis are available in Supplementary Material 5.

While most heart rate related features aligned with clinical knowledge and were clinically interpretable, several features directly related to sleep contradicted previously known patterns. Higher intradaily variability (IV_week, IV_holiday) was associated with reduced insomnia risk, opposing the established understanding that elevated IV reflects circadian disruption linked to greater sleep disturbances.²⁷ The sleep efficiency paradox, in which insomnia patients showed better efficiency than controls, raises concerns about relying on wearable-derived metrics without clinical validation.²⁸

These counterintuitive patterns suggest that the model may be capturing systematic biases or artifacts inherent in wearable-derived features rather than true physiological mechanisms. Although the model demonstrated acceptable predictive performance, its outputs may be misleading if interpreted without caution.

To further validate the stability of feature importance, we conducted a sensitivity analysis by comparing the top 5 SHAP values across the 20 best-performing models. Feature consistency analysis revealed that HR_CR_acrophase and self-reported stress were consistently identified as top predictors in 15 (75%) and 16 (80%) models, respectively. This high level of consensus underscores that circadian rhythm phase shifts and psychological stress are robust predictors of insomnia, independent of model architecture or hyperparameter configurations. In contrast, sleep_efficiency_week appeared in only 9 (45%) models. This relatively low consistency, combined with the observed “sleep efficiency paradox,' suggests that while sleep efficiency is a relevant factor, its predictive value may be more susceptible to model-specific biases or inherent measurement noise in wearable-derived sleep metrics. The SHAP summary plots of all models are presented in Supplementary Material 6.

Final model selection: Integration of clinical reasoning

The final selection of the best performing model integrated the following clinical reasoning: (1) adequate performance (F1 score = 0.868 ± 0.027), (2) interpretable relationships allowing clinical evaluation, and (3) transparency enabling informed clinician decision-making when predictions contradict expectations.

Discussion

The evaluation of symptoms and diagnosis in clinical psychiatry largely depends on patients’ self-reported symptoms. By utilizing digital phenotyping in psychiatry, many limitations of traditional clinical evaluation and diagnosis can potentially be addressed. In this study, we developed insomnia severity classification algorithms using automatically recorded passive data. This underscores the potential of wearable device data, digital phenotyping, and machine learning in providing a more reliable and scalable solution for insomnia classification.

The interpretable predictors recovered by our model can be situated within the two canonical pathophysiological frameworks of insomnia. The two-process model of sleep regulation describes sleep as the alignment of a homeostatic sleep drive (Process S), which accumulates during wakefulness, with a circadian arousal–sleep propensity rhythm (Process C).²⁹ Insomnia is increasingly understood as a disorder in which Process S–C alignment is disrupted and superimposed on a state of chronic 24-hour hyperarousal involving cognitive, emotional, and autonomic dimensions.³⁰ Notably, the most robust predictors identified in this study by SHAP analysis — delayed heart rate circadian acrophase, self-reported stress, and elevated tonic heart rate — map directly onto these two axes. This convergence suggests that the model, despite being trained without any mechanistic constraint, has recapitulated the principal physiological signatures of insomnia in a fully data-driven manner.

The leading predictor in our model, delayed acrophase of the heart rate cosinor rhythm, is best interpreted as a marker of circadian phase delay in autonomic nervous system activity. Heart rate exhibits a strong circadian rhythm driven by sympathetic–parasympathetic balance, with peak sympathetic activity typically occurring during the active phase and progressive sympathetic withdrawal during the rest phase.³¹ A delay in this acrophase implies that sympathetic dominance persists into the late evening and early sleep period — a state physiologically incompatible with the rapid sleep onset and consolidated nocturnal sleep expected under intact Process C. This pattern is consistent with longstanding observations of evening-type chronotype, weakened nocturnal parasympathetic tone, and circadian misalignment in insomnia, and it provides physiological grounding for the predictive value of HR_CR_acrophase.^32,33 Importantly, this interpretation positions the feature not as a statistical correlate but as a wearable-accessible proxy of Process C dysregulation, a property that is particularly attractive for longitudinal monitoring in real-world settings.

A second cluster of robust predictors — higher self-reported stress, elevated daytime minimum heart rate, and blunted maximum daytime heart rate — coheres tightly with the hyperarousal model of insomnia. Hyperarousal is characterized by sustained activation of the hypothalamic–pituitary–adrenal axis and the sympathetic nervous system across the 24-hour cycle, manifesting as elevated tonic heart rate, reduced heart rate variability, heightened emotional reactivity to daily stressors, and impaired downregulation of arousal in response to environmental cues.³⁴ The combination of an elevated minimum heart rate (indicating tonic sympathetic dominance) and a relatively suppressed maximum heart rate (indicating a diminished autonomic dynamic range and blunted reactivity) is a recognizable signature of this state. Coupled with the strong predictive contribution of subjective stress, these features suggest that the model is sensitive to both the autonomic and the psychological correlates of hyperarousal — factors that are widely regarded as precipitating and perpetuating in chronic insomnia.³⁵ The simultaneous identification of circadian phase delay and hyperarousal-related autonomic features therefore positions the model not as a black-box classifier of subjective sleep complaint, but as a tool whose decisions are anchored in the canonical pathophysiology of insomnia.

Several features, in contrast, exhibited associations that contradict established clinical expectations and warrant a more cautious interpretation. The “sleep efficiency paradox,' in which higher wearable-derived sleep efficiency was associated with greater predicted insomnia risk, is unlikely to reflect a genuine physiological phenomenon. It is most plausibly explained by two convergent factors. First, consumer-grade wearable devices systematically over-detect sleep during periods of motionless quiet wakefulness, a well-documented limitation of actigraphy-based sleep estimation³⁶; this leads to inflated sleep efficiency in patients whose hyperarousal manifests as cognitively active but motorically still wakefulness. Second, this pattern is reminiscent of paradoxical insomnia, a recognized clinical subtype in which patients report severe sleep disturbance despite relatively preserved objective sleep parameters^37,38; if such cases are present in our cohort, the model may be detecting a phenotype in which subjective complaint and objective wearable measurement diverge. The paradoxical inverse association between intradaily variability and insomnia risk admits a complementary interpretation grounded in Process S: healthy individuals may engage in a more diverse mix of weekend and holiday activities (social interaction, exercise, outdoor activity), naturally producing high intradaily variability, whereas individuals with insomnia — often experiencing daytime fatigue and reduced motivation — may show monotonous, sedentary patterns that depress intradaily variability despite worse sleep at night. The paradoxical relative amplitude finding likely extends the same dynamic. None of these patterns dismiss the underlying clinical concern; rather, they highlight that wearable-derived sleep and activity metrics are noisy proxies of the constructs they are intended to capture, and that their predictive use is most defensible when paired with mechanistic interpretation rather than treated as direct clinical readouts.

Taken together, these mechanistic and paradoxical findings inform the clinical translation of wearable-based machine learning for insomnia. The discrimination–calibration profile of our model — strong recall-driven F1, moderate AUC-ROC, and a Brier score only modestly above the no-skill reference implied by the marked class imbalance — is consistent with a screening-stage decision-support role rather than autonomous diagnosis. Such a role is well aligned with the mechanistically interpretable predictors we have identified: features that explicitly index Process C dysregulation and autonomic hyperarousal can be communicated to clinicians, cross-checked against subjective report and clinical history, and integrated into existing diagnostic workflows, whereas features that behave paradoxically (sleep efficiency, intradaily variability, relative amplitude) should be flagged for clinical scrutiny rather than acted upon directly. This stratified handling of model outputs — privileging predictors with established mechanistic anchoring while contextualizing wearable artifacts as artifacts — exemplifies the clinical plausibility verification stage of a structured validation framework for healthcare AI and offers a concrete template for translating wearable-derived ML into sleep-medicine practice.

Our findings argue for a validation framework that extends beyond conventional performance metrics. Building on recent clinical AI implementation guidelines,³⁹ we frame healthcare AI evaluation as a three-stage process: (i) data integrity — representative, unbiased datasets reflecting the target clinical population; (ii) statistical performance — conventional metrics including accuracy, sensitivity, specificity, discrimination, and calibration across patient subgroups; and (iii) clinical plausibility verification — an explicit assessment of whether identified patterns align with established physiological knowledge. This approach directly addresses the “implementation gap' that has limited real-world deployment of healthcare AI,⁴⁰ and is consistent with the multidimensional criteria advocated by the British Standard BS30440 and the European ITFoC consortium’s seven-step framework.^39,41,42 Our SHAP-based plausibility assessment operationalizes the third stage in a sleep-medicine context: it simultaneously surfaced biologically coherent predictors (delayed heart rate circadian acrophase, self-reported stress, daytime maximum heart rate) and paradoxical patterns (intradaily variability, sleep efficiency, relative amplitude). This dual outcome illustrates that high statistical performance—however quantified—is an incomplete signal of clinical readiness, and that the third stage is operational rather than aspirational.

This study has certain limitations that warrant consideration. The single-site Korean cohort used in the study may limit the generalizability to diverse populations with different sleep patterns and cultural contexts. The 4-week observation period provides only a cross-sectional snapshot of behavioral patterns and potentially misses seasonal variations or long-term trends relevant to chronic insomnia assessment. Our focus on clinical plausibility, while essential for trust and safety of the findings, may undervalue novel biological patterns that contradict the current understanding but may represent genuine discoveries. The challenge lies in distinguishing between statistical artifacts and previously unrecognized biological relationships, a distinction that requires careful clinical validation and mechanistic understanding. Future studies should employ comprehensive sensitivity analyses that compare multiple imputation strategies to ensure robust findings across different missing data scenarios.

Future research should prioritize prospective multisite validation studies employing randomized controlled trial designs with clinical outcomes as the primary endpoints. These studies must evaluate model generalizability across diverse populations and healthcare settings to ensure equitable AI deployment and identify potential algorithmic biases. Methodological advances should focus on developing interpretability-aware modeling frameworks that integrate automated feature selection with clinical knowledge constraints. The development of standardized evaluation metrics able to quantify clinical plausibility along with statistical performance represents a crucial next step in incorporating expert clinical judgment and alignment with established pathophysiological mechanisms. Finally, collaboration with regulatory agencies is essential to establish standardized clinical validation requirements for the application of AI in sleep medicine.

Conclusion

In this study, we developed insomnia prediction algorithms using automatically recorded passive data. This underscores the potential of wearable device data, digital phenotyping, and machine learning in providing a more reliable and scalable solution. This study contributes to the digital health field by emphasizing clinical trustworthiness in machine learning models. The models used can achieve excellent accuracy despite exhibiting patterns that may contradict clinical knowledge, potentially compromising physician trust and patient safety. This approach recognizes that healthcare AI must be conducive to clinical decision-making and patient care and not merely achieve statistical benchmarks. Future ML applications in sleep medicine must prioritize the integration of clinical knowledge and interpretability to ensure that advances in AI translate into safe, effective, and widely adopted clinical applications.

Supplemental material

Supplemental material - Original research article heart rate circadian phase and hyperarousal as wearable digital phenotyping of insomnia: An interpretable machine learning study

Supplemental material for Original research article heart rate circadian phase and hyperarousal as wearable digital phenotyping of insomnia: An interpretable machine learning study by Minji Kim , Seojin Yun, Hyungju Kim, Emma Matsushita3, Ji Won Yeom, Sujin Kim, Seung Pil Pack, Heon-Jeong Lee, Taesu Cheong, Chul-Hyun Cho in DIGITAL HEALTH

Footnotes

ORCID iDs

Minji Kim

Seojin Yun

Heon-Jeong Lee

Chul-Hyun Cho

Authors contributions

Conceptualization: MK, SY, CHC.

Data curation: HK, SPP, CHC.

Formal analysis: MK, SY, HK, EM, CHC.

Funding acquisition: SPP, CHC.

Investigation: MK, SY, HK, EM, JWY, SK, SPP, HJL, TC, and CHC.

Methodology: MK, SY, HK, EM, TC, CHC.

Project administration: SPP, HJL, TC, CHC.

Resources: CHC.

Software: MK, SY, HK, TC, CHC.

Supervision: SK, SPP, HJL, TC, CHC.

Validation: MK, SY, HK, CHC.

Visualization: MK, SY, HK, CHC.

Writing – original draft: MK, SY, HK, CHC.

Writing – review & editing: MK, SY, HK, CHC.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by grants from the National Research Foundation (NRF) of Korea (grant number: NRF-2021R1A5A8032895, NRF-2022M3C1B6080866, and RS-2026-25471696).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets generated and/or analyzed in this study are available from the corresponding author upon reasonable request. *

Trial Registration

Clinical Research Information Service (CRIS) KCT0009175 (Registration data: Feb-152024) (https://cris.nih.go.kr/cris/search/detailSearch.do?search_lang=E&focus=reset_12&search_page =M&pageSize=10&page=undefned&seq=26133&status=5&seq_group=26133).

Declaration of AI use

Generative AI (Google Gemini, OpenAI ChatGPT) was utilized solely for linguistic polishing and code refinement to enhance the presentation of the results. The final manuscript and all computational outputs were critically reviewed and validated by the authors, who maintain complete accountability for the integrity of the work.

Supplemental material

Supplemental material for this article is available online.

Appendix

References

Morin

Drake

Harvey

, et al. Insomnia disorder. Nat Rev Dis Primers 2015; 1: 15026. https://doi.org/10.1038/nrdp.2015.26

Riemann

Espie

Altena

, et al. The European insomnia guideline: an update on the diagnosis and treatment of insomnia 2023. J Sleep Res 2023; 32: e14035. https://doi.org/10.1111/jsr.14035

American Academy of Sleep Medicine . International classification of sleep disorders—third edition (ICSD-3). 2014. AASM.

Aziz

Ali

Aslam

, et al. Wearable artificial intelligence for sleep disorders: scoping review. J Med Internet Res 2025; 27: e65272. https://doi.org/10.2196/65272

De Zambotti

Goldstein

Cook

, et al. State of the science and recommendations for using wearable technology in sleep and circadian research. Sleep 2024; 47: zsad325. https://doi.org/10.1093/sleep/zsad325

Birrer

Elgendi

Lambercy

, et al. Evaluating reliability in wearable devices for sleep staging. npj Digit Med 2024; 7: 74. https://doi.org/10.1038/s41746-024-01016-9

Cho

Lee

. Applying circadian rhythm concepts in digital healthcare. Chronobiol Med 2021; 3: 1–3. https://doi.org/10.33069/cim.2021.0006

Jeong

Jeon

Kim

, et al. Machine learning-based prediction of restless legs syndrome using digital phenotypes from wearables and smartphone data. Sci Rep 2025; 15: 16349. https://doi.org/10.1038/s41598-025-01215-8

Ennab

Mcheick

. Enhancing interpretability and accuracy of AI models in healthcare: a comprehensive review. Front Robot AI 2024; 11: 1444763. https://doi.org/10.3389/frobt.2024.1444763

10.

Nasarian

Alizadehsani

Acharya

, et al. Designing interpretable ML systems to enhance trust in healthcare: a systematic review. Inf Fusion 2024; 108: 102412. https://doi.org/10.1016/j.inffus.2024.102412

11.

Kim

Pack

, et al. Machine learning–based prediction of ADHD and sleep problems with wearable data in children. JAMA Netw Open 2023; 6: e233502. https://doi.org/10.1001/jamanetworkopen.2023.3502

12.

Ong Ly

Unnikrishnan

Tadic

, et al. Shortcut learning in medical AI hinders generalization. npj Digit Med 2024; 7: 124. https://doi.org/10.1038/s41746-024-01118-4

13.

Goetz

Seedat

Vandersluis

, et al. Generalization: a key challenge for responsible AI in patient-facing applications. npj Digit Med 2024; 7: 126. https://doi.org/10.1038/s41746-024-01127-3

14.

Choi

Lee

, et al. Predicting sleep disorder risk using a machine learning questionnaire. J Med Internet Res 2023; 25: e46520. https://doi.org/10.2196/46520

15.

Bandyopadhyay

Goldstein

. Clinical applications of AI in sleep medicine. Sleep Breath 2023; 27: 39–55. https://doi.org/10.1007/s11325-022-02592-4

16.

Abgrall

Holder

Chelly

, et al.

Should AI models be explainable to clinicians?

Crit Care 2024; 28: 301. https://doi.org/10.1186/s13054-024-05005-y

17.

Gaube

Suresh

Raue

, et al. Do as AI says: susceptibility in clinical decision aids. npj Digit Med 2021; 4: 31. https://doi.org/10.1038/s41746-021-00385-9

18.

U.S. Food and Drug Administration . Artificial intelligence/machine learning-based software as a medical device (SaMD) action plan. Silver Spring (MD). FDA, 2021.

19.

World Health Organization . Ethics and governance of artificial intelligence for health. 2021. WHO.

20.

Collins

Moons

KGM

Dhiman

, et al. TRIPOD+AI statement. BMJ 2024; 385: e078378. https://doi.org/10.1136/bmj-2023-078378

21.

Lee

Cho

King

. Editorial: addictive disorders and digital medicine. Front Psychiatry 2025; 16: 1674826. https://doi.org/10.3389/fpsyt.2025.1674826

22.

Yeom

Kim

Pack

, et al. Digital phenotyping and insomnia discrepancy. JMIR Ment Health 2025; 12: e67478. https://doi.org/10.2196/67478

23.

Kim

Cho

Suh

. Korean sleep health index validation. Sleep Breath 2025; 29: 1–9.

24.

Lim

Jeong

Song

, et al. Predicting mood episodes using wearable sleep data. npj Digit Med 2024; 7: 324. https://doi.org/10.1038/s41746-024-01333-z

25.

Akiba

Sano

Yanase

, et al. Optuna: hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD, 2019, pp. 2623–2631.

26.

Mawlood

Hassan

Muhammed

, et al. Improving cardiovascular disease prediction. UHD J Sci Technol 2025; 9: 149–168. https://doi.org/10.21928/uhdjst.v9n1y2025.pp149-168

27.

Kang

, et al. Validity of wearable sleep tracker in insomnia. J Psychosom Res 2017; 97: 38–44. https://doi.org/10.1016/j.jpsychores.2017.03.009

28.

Willoughby

Golkashani

Ghorbani

, et al. Wearable sleep tracker performance. Sleep Health 2024; 10: 356–368. https://doi.org/10.1016/j.sleh.2024.02.007

29.

Borbély

Daan

Wirz-Justice

, et al. Two-process model of sleep regulation. J Sleep Res 2016; 25: 131–143. https://doi.org/10.1111/jsr.12371

30.

Levenson

Kay

Buysse

. Pathophysiology of insomnia. Chest 2015; 147: 1179–1192. https://doi.org/10.1378/chest.14-1617

31.

Black

D’Souza

Wang

, et al. Circadian rhythm and arrhythmogenesis. Heart Rhythm 2019; 16: 298–307. https://doi.org/10.1016/j.hrthm.2018.08.026

32.

Grimaldi

Reid

Papalambros

, et al. Autonomic dysregulation in insomnia. Sleep 2020; 44: zsaa274. https://doi.org/10.1093/sleep/zsaa274

33.

Kim

Jung

Kim

, et al. Autonomic dysfunction in sleep disorders. J Clin Neurol 2022; 18: 140–150.

34.

Dressle

Riemann

. Hyperarousal in insomnia disorder. J Sleep Res 2023; 32: e13879.

35.

Dressle

Feige

Spiegelhalder

, et al. HPA axis activity in insomnia. Sleep Med Rev 2022; 62: 101588. https://doi.org/10.1016/j.smrv.2022.101588

36.

Willoughby

Alikhani

Karsikas

, et al. Country differences in sleep variability. Sleep Med 2024; 110: 155–165.

37.

Rezaie

Fobian

McCall

, et al. Paradoxical insomnia review. Sleep Med Rev 2018; 40: 196–202. https://doi.org/10.1016/j.smrv.2018.01.002

38.

Perlis

Posner

Riemann

, et al. Insomnia. Lancet 2022; 400: 1047–1060. https://doi.org/10.1016/S0140-6736(22)00879-0

39.

Sujan

Smith-Frazer

Malamateniou

, et al. Validation framework for AI in healthcare. BMJ Health Care Inform 2023; 30: e100749. https://doi.org/10.1136/bmjhci-2023-100749

40.

Baxter

, et al. Implementation of AI in medicine. Nat Med 2019; 25: 30–36. https://doi.org/10.1038/s41591-018-0307-0

41.

Sendak

Liu

Beecy

, et al. Strengthening AI use in healthcare. J Am Med Inform Assoc 2024; 31: 1622–1627. https://doi.org/10.1093/jamia/ocae119

42.

Tsopra

Fernandez

Luchinat

, et al. Validation of AI in precision medicine. BMC Med Inform Decis Mak 2021; 21: 274. https://doi.org/10.1186/s12911-021-01634-3

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.35 MB