Abstract
Background
Pressure injuries (PIs) remain a major global healthcare challenge, increasing morbidity and costs. While Machine Learning (ML) models effectively predict PI risk, their lack of uncertainty quantification limits clinical trust. Conformal Prediction (CP) addresses this issue by providing statistically valid confidence estimates, enhancing model transparency and reliability in clinical practice.
Objective
This study aims to improve the reliability and clinical applicability of PI prediction models by integrating CP with traditional ML algorithms, enabling uncertainty-aware predictions for safer and more transparent clinical decisions.
Methods
A methodological study was conducted to evaluate ML classifiers for PI risk classification using routinely collected clinical data and alternative data-processing strategies. Model performance was assessed through paired statistical comparisons across resampling folds. Models showing superior performance were subsequently calibrated using a CP framework to generate uncertainty-aware prediction sets, which were evaluated using coverage and efficiency metrics across different confidence levels.
Results
The overall incidence of PIs was 27%, with significant predictors including hospital ward, risk level, mattress use, containment, dermatitis, and age (p < 0.05). Among all models, XGBoost (XGB) and Random Forest (RF) showed the best predictive performance. After calibration with CP, both produced well-aligned uncertainty estimates. At α = 0.10, XGB achieved coverage = 0.949 and efficiency = 1.34, outperforming RF with tighter, more informative prediction sets, with a loss of 5% of cases outside the prediction set and an average size of the prediction set closer to 1.0 compared to the RF model. This framework enhanced model interpretability in clinical settings without compromising accuracy.
Conclusions
Integrating CP into ML models may improve the interpretability and reliability of risk predictions by quantifying uncertainty. Although the findings are promising, they should be interpreted with caution given the modest sample size, event rate, single-center design, and potential variability in clinical documentation. This framework provides a foundation for uncertainty-aware decision support in PI prevention.
Keywords
1. Introduction
Pressure Injuries (PIs), also known as pressure ulcers, pressure sores, bedsores, or decubitus ulcers, are localized damage to the skin and underlying tissues caused by prolonged pressure, often in combination with friction or shear. 1 Their primary mechanism involves the interruption of blood flow, which restricts the supply of oxygen and nutrients, ultimately leading to tissue ischemia, cell death, and ulceration. 2 PIs are widely recognized as indicators of care quality within healthcare institutions and are associated with substantial negative impacts on patients’ health-related quality of life. 3 Furthermore, they represent a significant economic and organizational burden on public and private healthcare systems. 4 Common risk factors include prolonged immobility, shear forces, friction, moisture, diabetes, vascular disease, nutritional deficiencies, sensory impairments, incontinence, and overall high levels of care dependency.5–7
Globally, the prevalence of PIs remains a significant concern. The overall prevalence among hospitalized patients is approximately 13%, with 62% of cases being hospital-acquired. 8 In intensive care units (ICUs), prevalence reaches around 60%, with 53% being hospital-acquired and 23% being device-related. 9 Long-term care facilities report prevalence rates near 20%, while nursing homes report rates close to 12%.10,11 In the United States alone, an estimated 2.5 million individuals develop a PI annually during acute care. 12 Severe PIs are often classified as rare but largely preventable adverse events, commonly linked to avoidable medical errors. 13
The economic burden of PIs is significant. In the United States, each hospital-acquired PI incurs an additional cost of approximately USD 10,700 per patient, contributing to an estimated annual total of USD 26.8 billion. 14 In some health systems, PIs account for up to 7% of total expenditures. 15 Moreover, patients with PIs have longer hospital stays and higher resource utilization. 16
Risk assessment tools, particularly the Braden Scale, 17 remain the standard approach to identifying at-risk patients across hospitals, nursing homes, and general care facilities. 7 However, these tools rely on subjective clinical evaluation and may not fully capture the complexity of patient conditions, leading to a potential underestimation of risk. 6 Consequently, there is a need for complementary, data-driven approaches that support early detection and guide targeted preventive interventions. In particular, predictive models capable of identifying high-risk patients while explicitly communicating prediction uncertainty may help clinicians prioritize preventive actions and improve patient safety. Preventive strategies, such as systematic repositioning and pressure-relieving surfaces, have proven effective in reducing PI incidence and severity.18,19
In this context, advances in computational learning have yielded promising results that address the limitations of current methods in PI assessment. Particularly, the wide range of models within the Machine Learning (ML) framework appears as the natural step into PI risk assessment. 20 However, traditional ML models provide a single probability and require choosing an arbitrary threshold; as a result, this approach does not convey the reliability of the predictions. In nursing practice, decisions are rarely binary and are made in terms of confidence and uncertainty, while borderline cases are often escalated for additional assessment.
Conformal Prediction (CP) and its integration into ML models identify cases for which the model is uncertain, preventing over-reliance on automation. For PI prevention, this appears to be a critical stage, given that patient conditions change rapidly, data quality varies, and clinical heterogeneity is high. CP acknowledges uncertainty and, rather than hiding it, expands prediction sets as uncertainty increases. Furthermore, CP naturally creates a complementary triage system, patients with low and high risk can receive care according to the risk level, while ambiguous cases receive manual reassessments, mirroring real workflows in health systems.
Given the multi-factorial and dynamic nature of PI development, predictive models must account not only for complex interactions among clinical variables but also for the uncertainty inherent in patient care environments. In clinical practice, risk assessments rarely rely on deterministic predictions. Instead, healthcare professionals continuously evaluate degrees of certainty when prioritizing preventive interventions and allocating resources. In this context, CP aligns naturally with clinical reasoning by distinguishing between confident and uncertain predictions, thereby enabling safer human decision processes. Rather than forcing binary risk classifications, CP allows predictive systems to explicitly communicate when model outputs should be interpreted with caution, supporting clinicians in identifying cases that may require additional monitoring or reassessment.
One of the primary barriers for the adoption of computational models in health settings is the lack of transparent uncertainty estimates accompanying model predictions, which can reduce clinicians’s trust in automated systems and limit their usefulness in decision-support workflows. 21 Many existing ML approaches provide only point predictions or probability scores, leaving healthcare professionals without clear guidance regarding the reliability of those predictions under varying clinical conditions. Consequently, there is a critical need for predictive frameworks that combine strong predictive performance with interpretable uncertainty quantification. In response to this gap, the present study aims to enhance the reliability and practical applicability of PI prediction models by integrating CP into an ML framework, an area where uncertainty-aware prediction remains unexplored. By providing statistically valid measures of predictive uncertainty, this approach seeks to improve the transparency, interpretability, and clinical usability of ML-based risk prediction tools, supporting safer and more informed preventive strategies in PI management.
1.1. Machine learning for pressure injury prediction
ML methods have become increasingly relevant in healthcare due to their ability to model complex, nonlinear relationships and support both diagnostic and prognostic tasks across diverse clinical applications.22–24
Evidence across multiple domains, including mental health, 25 multimodal precision health, 26 oncology, 27 and healthcare management, 28 demonstrates the potential of ML to enhance decision-making.
For PI prediction, ML methods have been successfully applied using diverse data sources. Recent advances have shown promising results in early PI prediction, capturing complex relationships among clinical features, and have outperformed traditional risk assessment tools.29–31 Studies have used electronic health records (EHRs) to model PI development in hospital units.32–34 Other approaches leverage unstructured clinical notes through Natural Language Processing, 35 or integrate more complex representations such as graphical models, Bayesian networks, or deep learning models analyzing imaging data.36–38
These ML models often outperform traditional tools by better capturing interactions among clinical variables and providing early identification of high-risk patients. However, clinicians need confidence measures to make safer decisions for their patients, and most existing approaches provide only point predictions, while clinical data is inherently noisy and uncertain. For clinicians, reliable confidence measures are essential to support safe decision-making.
Following the TRIPOD reporting principles, ML models were developed using baseline clinical variables to evaluate uncertainty-aware prediction sets under different modeling configurations. The integration of CP provides calibrated uncertainty estimates, supporting methodological evaluation of model reliability and robustness under limited sample sizes and potential data variability. 39
1.2. Conformal prediction
CP provides a principled framework for quantifying uncertainty around individual predictions. It generates prediction sets with formal coverage guarantees, making the output easy to interpret and grounded in statistical validity. 40 CP can be applied to any underlying ML model without requiring assumptions about the data distribution or additional model parameterization beyond choosing a confidence level.
Applications of CP span multiple fields, including biometrics, facial recognition, biochemistry, finance, and healthcare. In medical contexts, CP has supported tasks such as disease progression prediction under uncertainty, 41 reliable genomic medicine, 42 depression prediction, 43 lung cancer detection, 44 coronary artery disease diagnosis using multi-objective methods, 45 and data cleaning in biomedical sciences. 46
The standard CP methodology involves first selecting a nonconformity measure, then training an ML model, and third, computing nonconformity scores for the calibration and test sets. These steps enable the generation of valid, efficient uncertainty-aware predictions. 39
The clinical characteristics of PI development make this problem especially well-suited for uncertainty-aware modeling. PIs are rare, multifactorial, and highly sensitive to variations in mobility, hemodynamic stability, moisture exposure, device-related pressure, and the quality of preventive care. Small, clinically plausible changes in these factors can substantially modify a patient’s accurate risk profile. As a result, two patients may receive the same predicted probability while differing markedly in the reliability of that estimate. CP directly addresses this challenge by producing statistically calibrated prediction sets that indicate when the model is confident and when predictions should be interpreted with caution, for instance, during atypical patient presentations, incomplete records, or shifts in clinical practice. Studies in other high-stakes clinical domains demonstrate that CP can effectively flag uncertain or unreliable predictions before they propagate into clinical decision-making, thereby enhancing trustworthiness and reducing diagnostic risk. 47 This property aligns closely with the realities of PI prevention, where false negatives may result in avoidable harm, and false positives increase the preventive workload for nursing teams already operating under high demand. By signaling the degree of uncertainty for each patient-level prediction, CP enables more transparent, risk-aware decision support, thereby strengthening the clinical utility of ML-based PI prediction systems.
2. Methods
The methodological framework applied in this paper is summarized in Figure 1. The process consists of four main stages. First, a data preprocessing stage was conducted, which included formatting the dataset and selecting relevant features based on their statistical properties and clinical plausibility. Second, a model-fitting stage was performed using a set of traditional ML algorithms. Third, a statistical comparison of the models was conducted. Finally, CP was applied to the selected models. This step allowed for the construction of prediction sets with formal coverage guarantees. Performance metrics derived from CP, such as validity and efficiency, were then calculated and compared against traditional point-prediction measures to assess the added value of uncertainty quantification. Model training and evaluation were conducted in Python 3.11.5, while statistical analysis and pairwise comparisons were performed in R 4.4.2. Summary of the methodological framework integrating supervised machine learning with conformal prediction.
2. 1. Study population
The study adopted an observational, cross-sectional design and used data from a tertiary-level healthcare institution in an urban area of the Metropolitan Region of Santiago, Chile. Data collection was conducted through a probabilistic random sampling procedure implemented by trained healthcare professionals across multiple hospital wards to ensure the representativeness of the inpatient population. The sample size was calculated at a 95% confidence level, with a 5% margin of error and an anticipated attrition rate of 20%, yielding an initial recruitment target of approximately 300 patients. Following data cleaning procedures, specifically the exclusion of atypical observations, incomplete records, and duplicated entries, the final analytical sample comprised 245 patients.
Data were collected using standardized forms, and information was extracted from patients’ electronic health records. The instruments were specifically designed to capture comprehensive information on both patient-specific characteristics and variables associated with the risk and development of PIs. The data collection framework was organized into four main domains: (i) sociodemographic characteristics, (ii) clinical parameters, (iii) laboratory and diagnostic test requests documented by attending physicians, and (iv) PI-related variables encompassing occurrence, stage, and anatomical location.
PIs were classified in accordance with the criteria established by the National Pressure Injury Advisory Panel (NPIAP). The assessment followed a dichotomous classification approach. Patients exhibiting no evidence of skin or tissue damage consistent with the NPIAP definition, including those with intact skin displaying only transient, blanchable erythema, were categorized as negative for PI. Conversely, patients were classified as positive when at least one lesion met the diagnostic criteria for a PI, irrespective of its severity or stage (Stages 1-4 or suspected deep tissue injury). Accordingly, the presence of a single lesion at any anatomical site was sufficient for a positive PI classification. 48
2.2. Data cleansing and formatting
The data wrangling and preprocessing stage involved transforming unstructured nursing datasets into a format suitable for analysis. Relevant nursing documentation was extracted from the Hospital Information System (HIS) for each patient record. From these source data, key clinical concepts associated with the development of PIs were identified. This process was conducted in close collaboration with expert clinicians to ensure that text-based information was systematically converted into standardized and structured variables.
Since HIS records contain multiple entries per patient, the preprocessing stage considered a comprehensive aggregation and conflict-resolution strategy. For this purpose, time-window aggregation is used to group entries within clinically meaningful periods. For categorical data, conflicts in entries were resolved by selecting the most critical or informative values. For frequency-based features, values were aggregated to reflect the intensity or frequency of relevant events.
2.3. Feature selection
Feature selection was performed using the Chi-square test for categorical variables and the t-test for numerical features. The Chi-square test assesses whether the observed frequency distribution of two categorical variables deviates significantly from the expected distribution under the assumption of independence. The t-test identifies features with significant mean differences between outcome groups, supporting their inclusion in the modeling stage. This procedure ensured that only variables with meaningful statistical relationships to PI development were included in the training stage, thereby reducing dimensionality, improving model interpretability, and enhancing computational efficiency. 49
2.4. Feature encoding
Since the dataset comprised categorical variables, a feature encoding step was required to transform qualitative attributes into numerical representations suitable for ML algorithms. Two complementary methods were applied: One-Hot Encoding (OHE) and Target Encoding (TE).
OHE is a method that converts categorical variables into a numeric format without imposing any order or magnitude on the categories. OHE creates K − 1 new binary features, each indicating whether the observation belongs to a specific category, while one is dropped to avoid mutual information sharing among features. This means that each category k ∈ K gets its own slot, and each observation activates only one of them.
Another process for handling categorical features is TE. To understand TE, let’s consider a feature space in the form x = {x1, …, x
p
} and a target variable y
i
∈ {0, 1} for binary classification. The TE method transforms a categorical variable x
j
∈ {c1, …, c
k
} into numerical values based on the conditional expectation of the target y.
50
Its application is shown in (1).
As a result, each category is replaced by its empirical mean. To prevent overfitting, a regularization method is applied, as shown in (2).
The key distinction between OHE and TE lies in their impact on dimensionality. Whereas OHE expands the feature space by generating one binary column per category, which may aggravate the curse of dimensionality. Conversely, TE encodes each category into a single continuous value, making it particularly suitable for high-cardinality categorical variables while retaining predictive information. 50
For both methods, the encoding is conducted in an out-of-sample manner to avoid data leakage. Specifically, for each resampling iteration, encoding parameters were estimated using the training subset and subsequently applied to the corresponding testing subset, ensuring that no information from validation outcomes influenced feature construction. In the case of TE, category-specific statistics derived from the outcome variable were computed only from training observations, thereby preventing any observation from contributing to its own encoded value. In the case of OHE, it was fitted on the training data to define the set of categorical levels, and the resulting information was applied unchanged to the validation data. This fold-wise preprocessing strategy guarantees separation between training and validation information, yielding unbiased performance estimates and mitigating optimistic bias arising from data leakage. 51
2.5. Cross-validation
Cross-validation (CV) is a resampling technique used in ML and Statistics to evaluate a model’s performance, robustness, and generalization on unseen data. It helps prevent overfitting and ensures that the model’s evaluation is not dependent on randomness due to a specific setting. 52 The basic idea is to repeatedly split the dataset into training and validation sets and compute performance metrics for each split. The results are averaged to obtain a more stable estimation of the performance metrics.
In this paper, a k-fold CV scheme was adopted. The dataset was divided into k approximately equal folds while preserving the distribution of the target variable (stratified CV). For each iteration, k − 1 folds were used for training, and the remaining fold served as the validation set. This process was repeated k times, ensuring that each observation was used once for validation and k − 1 times for training. The averaged results across folds provided reliable estimates of the models’ predictive performance for the PI classification task, along with the results of the CP framework embedded within this approach.
2.6. Machine learning models
A total of seven ML models for binary classification were evaluated during the analysis. The models included tree-based algorithms: Decision Tree (DT), Random Forest (RF), and Extreme Gradient Boosting (XGB) for sequential learning; a logit model: Logistic Regression (LR); and margin-based models: Support Vector Machines for classification (SVC) with different kernel types, particularly SVC with a linear kernel, SVC with a polynomial kernel, and SVC with a Radial Basis Function (RBF). The selection of such models reflects different learning paradigms suitable for handling both linear and nonlinear patterns presented in clinical data.
Although ensemble learning can combine heterogeneous models, in this study, we focused on a tree-based approach, specifically on RF. This selection is mainly due to the complexity of heterogeneous learning, which arises from the combination of different paradigms within an optimization framework. Furthermore, mixing heterogeneous models also increases the risk of overfitting when the sample size is limited, since there is insufficient data to train multiple models and perform CV reliably. 53
2.7. Hyperparameter optimization
Set of hyperparameters chosen for the optimization process conducted in the cross-validation setting.
Hyperparameter ranges shown in Table 1 were selected to cover plausible model complexity in a clinical context, aiming to reduce the risk of overfitting in a modest sample and to capture nonlinear interactions. Furthermore, the ranges keep the search computationally feasible within the nested cross-validation procedure.
The hyperparameter search setting is an exhaustive search looking for all combinations of parameters within the defined set of potential values. The choice of the specific values was based on common practices in applied ML. For the DT model, given the sample size, we used a bounded, theory-consistent search space that spans from strongly regularized trees to moderately flexible trees. In the case of LR, the regularization is addressed with two methods ({l1, l2}), using a penalization parameter that ensures the search covers both heavily regularized and lightly regularized regimes. For RF and XGB, moderated complexity models were tested, enforcing strong regularization controls. The number of estimators (trees) was capped at 100 because RF performance typically stabilizes beyond moderate ensemble sizes, especially in small datasets. In the case of XGB, hyperparameters were selected to balance predictive flexibility with strong regularization, which is critical in small datasets. In this sense, tree depth, learning rate, and number of trees in the boosting rounds jointly control model complexity and overfitting issues. Finally, in the case of SVC, if kernels are too flexible, the models can easily overfit, so the grid explores different functional forms to control complexity through regularization parameters.
The class imbalance issue was addressed by assigning class weights proportional to each category’s frequency. This approach prevents the models from biasing towards the majority class and improves their ability to classify the minority class. 55 The use of this approach, instead of artificial sampling methods, was primarily due to the limited sample size. In small datasets, synthetic samples (e.g., SMOTE, undersampling or oversampling methods) may increase the risk of overfitting by introducing unrealistic patterns that are not present in the actual data. 56
2.8. Performance metrics for binary classification
Confusion matrix for a binary classification task.
From the Confusion Matrix, a set of metrics is computed to measure binary classification performance. Specifically, the Accuracy, Precision, and F1-Score are computed directly from the values provided by the classification models. Furthermore, the Area Under the Curve (AUC) is calculated from the Receiver Operating Characteristic (ROC) curve, which plots the True Positive Ratio (TPR) against the False Positive Ratio (FPR). 57 Furthermore, the Brier-Score is computed to provide a more nuanced evaluation of probability, whereas standard accuracy and precision metrics only look at the final classification label. It is worth mentioning that Accuracy, Precision, F1-Score, and AUC metrics look for the highest possible value, whereas the Brier-score looks for the lowest possible value. 58 Later, as each classification metric is computed across k folds, pairwise comparison methods are used to test for significant differences in model performance across the various metrics.
2.9. Statistical comparison of classification performance
The classification metrics are calculated across folds for various models. For this setting, Friedman’s test is employed, as performance metrics are paired within folds, providing evidence of significant differences in performance across models. Once the Friedman test is conducted and significant differences are identified, the Wilcoxon signed-rank post hoc test is used to determine which models exhibit significant differences in classification performance across folds. This test is suitable as it handles the paired structure in the data while controlling for the family-wise error rate. 59
2.10. Conformal prediction Framework
To understand the basis of CP, let’s consider z = x × y the input-output space forming a dataset in the form z = z1, z2, …, z
n
, such that z
i
= (x
i
, y
i
) is assumed to be exchangeable. This latest assumption is a weak form of the iid assumption for statistical models. Given a new input xn+1, the goal is to construct a prediction set Γ
n
(xn+1) ⊆ y such that:
This setting can be applied to any ML algorithm. Furthermore, no additional parameterization is required to select the non-conformity measure. As stated above, a key advantage of CP is the validity of predictions provided by its application, considering the assumptions of exchangeability and randomness.60,61
2.11. Conformal prediction metrics
To contrast the ML metrics and provide an objective measure of the CP results, the coverage and efficiency metrics are employed. Let n be the number of test samples, let y
i
be the true label for the i-th test point, and let
It is worth mentioning that the metrics shown in (7) and (8), along with the CP framework, benefit from high-performing models. Particularly, the selection of the models with the best performance yields the following properties
42
. • Narrower prediction sets: A high-performing model assigns higher scores to the correct class and lower scores to others, allowing CP to generate more informative sets. • Better adaptability: A weak model tends to give large sets to each instance, losing the nuance that CP can provide. • Reduction of universal sets: If the model is extremely poor, the CP framework yields instances where the prediction set is empty or includes every possible label.
The methodological framework was implemented in two independent stages to avoid information leakage between model optimization and uncertainty quantification. In the first stage, multiple ML models were trained and tuned using a k − fold cross-validation setting to identify the best-performing models and their optimal configurations. Model selection was based on predictive performance from out-of-fold validation. In the second stage, CP was applied as a post-hoc uncertainty quantification framework. The hyperparameters selected in the first stage were fixed, and the selected models were retrained within a new k − fold cross-validation procedure. Out-of-fold predictions generated in this stage were used to compute CP sets and evaluate the coverage and efficiency metrics shown in (7) and (8).
3. Results
3.1. Occurrence of pressure injuries
The total incidence of PIs was approximately 27% among patients. About 20% of the injuries occurred in the Surgery ward, followed by the MQ IND and the critical patient wards, with 18% and 17%, respectively. About 62% of the patients were male. The average age of the patients was 59 years. Females were older, with an average age of 65 years, while males had an average age of 57 years. In terms of risk, about 39% of patients were categorized as high-risk, 22% as medium-risk, and 33% as low-risk, all during their admission to the health center units.
3.2. Univariate analysis of features for LPP prediction
Comparison of demographic, clinical, and care-related characteristics between patients with PI and without PI (NPI). Continuous variables are expressed as mean ± standard deviation, and categorical variables as counts. Statistics and p-values correspond to Chi-square test results for categorical data and t-test results for numerical variables.
3.3. Model performance
Figure 2 compares how the choice of encoding techniques impacts model performance across different algorithms. The effect of encoding is model-dependent. Some algorithms benefit from TE, while others perform better with OHE. The DT model shows slightly higher variability with OHE across all metrics. TE yields more stable results, as highlighted by the AUC and F1-score metrics. The LR model shows similar results across encoding methods. Since this model is linear, OHE often works better, as it preserves feature independence. The RF model performs better with OHE across all metrics, particularly in precision and accuracy. The performance of the SVC (linear and polynomial) fluctuates, but TE tends to yield more consistent scores, especially in AUC and F1-score. The SVC RBF displays high variability with both encodings but shows slightly better median metrics with TE. The XGB consistently achieves the highest performance with TE, thanks to its compact numerical representations. Performance of various ML models in the classification task under one-hot encoding y Target encoding methods for encoding.
Overall, the AUC and accuracy tend to be higher with TE for nonlinear models (SVCs, XGB). The RF with OHE shows outstanding precision. In terms of the F1-score, the models achieve a more balanced performance under TE, indicating a better tradeoff between precision and recall. In summary, TE offers advantages for complex nonlinear models, possibly by reducing dimensionality and introducing numerical features. Details of the improvements in the hyperparameter optimization process with cross-validation are shown in Table S11 in the Supplementary Material.
The results of the pairwise comparisons depend on the feature encoding methods used during fitting. The Accuracy shows significant differences with both encoding methods (p-values
Pairwise comparisons using the Wilcoxon signed-rank test reveal significant differences, primarily for the RF and XGB models. The RF shows significantly better accuracy than DT, LR, SVC Linear, and SVC RBF, using both OHE and TE methods (p-values
Summary with average of performance metrics for RF and XGB models per encoding method, values in parenthesis denote the standard deviation within the cross-validation framework.
3.4. Reliability of proposed models
Different values of confidence are tested, as shown in Figure 3. The chart above shows the ideal target coverage as a red dashed line. The closer the empirical line is to the red dashed target, the better the calibration of the conformal predictor. The blue (XGB) and green (RF) curves show the actual coverage achieved by both models. The models slightly underperform for some values of α. The chart below shows the average prediction set size for different values of α. As α increases, confidence decreases, and the prediction sets become smaller. In this sense, XGB tends to yield slightly smaller prediction sets than RF for the same α values, resulting in tighter intervals or fewer predicted labels on average, without compromising the model’s coverage. Empirical coverage and average prediction set size of conformal predictors applied to XGB and RF across different significance levels (α).
The choice of the α value depends on a trade-off between reliability and efficiency. As shown in Figure 4, the behavior of the CP metrics is model-dependent. In the context of PI, as pressure injury prevention has asymmetric costs, with missed high-risk cases leading to increased hospitalizations and serious harm, the CP framework must achieve high but practical coverage with actionable outputs. In this regard, if the α value is too small, prediction sets become large, and the model becomes uninformative; thus, the choice of α is analogous to selecting clinical risk cutoffs. Figure 4 shows the behavior of the coverage and set size for various α values. Regarding coverage, both models yield similar results; however, for values of α ≥ 0.10, the RF model shows greater variability than the XGB. In terms of efficiency, the RF model shows larger set sizes for α values above 10%. For this setting, a value of α = 0.10 ensures 90% coverage, representing a pragmatic compromise and reliability guarantees while maintaining prediction sets sufficiently specific to support actionable clinical decisions. Furthermore, this choice aligns with the goal of uncertainty-aware decision support rather than fully automated classification frameworks. Detailed results of coverage and set size for various alpha values in the conformal prediction setting.
The behavior of the empirical coverage being slightly higher than the nominal level is expected in CP frameworks due to the discrete nature of nonconformity scores and finite sample effects, which lead to conservative prediction sets. In practice, mild coverage is preferred to under-coverage for clinical decision-support contexts, as it ensures that the true outcome is included in the prediction set more frequently than the nominal guarantee.
Furthermore, at α = 0.10, XGB achieves a set size of 1.34. In a binary classification setting, this implies that the model produces a single, confident prediction in the majority of cases (around 66%), and assigns two-label prediction sets only when model uncertainty is high (around 34%).
Figure 5 shows a breakdown of CP setting results in terms of coverage and set size for an α = 10%. The CP prediction framework maintained coverage close to the nominal level across all patient risk groups. Low-risk patients exhibit the highest efficiency, with prediction sets that almost always contain a single label, indicating high confidence in model predictions. In contrast, medium- and high-risk patients produced larger prediction sets, reflecting increased uncertainty in the classification task. Across models, RF tends to produce larger prediction sets, whereas XGB exhibits more conservative behavior. Results of coverage and set size for patients with low, medium and high risk in the conformal prediction framework with confidence fixed at α = 10%.
4. Discussion
This study shows the potential of integrating CP into ML frameworks for predicting PIs. The findings highlight that conventional ML models, such as RF and XGB, can effectively identify patterns to address the risk of developing PIs. Results from models tested in this study yielded higher performance results compared to baseline methods, specifically unsupervised explorations and scales, such as Braden. In this sense, the literature has analyzed the effectiveness of such scales, yielding results of about 70% in terms of AUC and predictive performance,62,63 while the ML models used in this analysis reached an AUC near 90% and predictive performance higher than 80%. Furthermore, the addition of CP introduces a crucial layer of uncertainty quantification, significantly enhancing the reliability, interpretability, and clinical utility of the predictions.
Our results suggest that CP-enabled models can provide valid confidence intervals or prediction sets, enabling clinicians to make better-informed decisions. In contrast to traditional models that output point predictions, conformal predictors communicate the degree of confidence associated with each prediction, allowing clinical teams to highlight when predictions are uncertain and require additional patient assessment. In this sense, Uncertainty-aware approaches are increasingly recognized as an important component of reliable medical AI systems. 64 This property is particularly valuable in clinical contexts where missed high-risk cases can lead to severe complications, extended hospitalization, or mortality. 65
From an operational standpoint, incorporating CP could improve clinical workflow efficiency. High-confidence predictions can trigger preventive measures (e.g., repositioning schedules and advanced support surfaces), while uncertain predictions can prompt manual reassessments. This stratified approach can optimize resource allocation and potentially reduce the economic burden of hospital-acquired PIs, which remain substantial worldwide.66,67
From a clinical workflow perspective, prediction sets produced by CP may support differentiated decision pathways. When the model produces a confident prediction indicating a high risk of PI development, the patient could be automatically flagged for preventive interventions such as repositioning protocols or pressure-relieving surfaces. In contrast, when the prediction set reflects higher uncertainty, additional clinical assessment by nursing staff may be required before implementing targeted interventions. In this way, uncertainty-aware predictions help allocate clinical attention where algorithmic confidence is lower, supporting safer human-in-the-loop decision-making. In practice, uncertainty-aware predictions could be communicated to clinicians through simple categorical outputs derived from the prediction sets. Singleton prediction sets indicate confident risk classification, whereas larger prediction sets reflect higher uncertainty and may prompt enhanced clinical evaluation.
Nevertheless, several limitations warrant attention. The quality of the calibration set critically influences CP performance. Non-representative calibration data may introduce errors in the prediction regions. Additionally, conformal methods introduce modest computational overhead, which may affect real-time deployment in large-scale monitoring systems. 68 Future research should examine adaptive and online CP variants that update the framework dynamically as new data becomes available. Finally, although the current findings are promising, prospective validation and multi-center studies are needed to confirm generalizability across diverse patient cohorts and care settings.
Regarding computational complexity, the proposed framework relies on ensemble learning models such as RF and XGB, whose computational cost mainly depends on the number of trees and dataset size. Given the relatively small number of predictors and observations in this paper, model training and parameter specification were computationally inexpensive. The CP stage adds only a calibration step based on nonconformity scores, incurring minimal additional computational cost.
5. Clinical interpretation
The integration of CP into ML frameworks for PI prediction represents a significant advancement in clinical risk assessment. Traditional predictive tools rely on subjective criteria and often fail to account for the complex, dynamic interactions among physiological, environmental, and procedural factors that contribute to PI development. The study shows that CP improved model calibration and provided statistically valid uncertainty measures without compromising predictive performance. Besides, it can maintain not only high predictive accuracy but also produce valid confidence estimates. These calibrated uncertainty intervals enhance clinicians’ ability to interpret model outputs, enabling more transparent and trustworthy diagnostic support for patients at varying risk levels.39,69
From a clinical standpoint, the ability to distinguish high-certainty from uncertain predictions offers an operational advantage in healthcare settings. When models indicate high confidence in risk classification, preventive interventions such as repositioning schedules, advanced mattress use, or early mobilization can be implemented immediately. Conversely, uncertain predictions can trigger additional clinical evaluations or monitoring, thereby optimizing the allocation of limited resources and reducing the likelihood of both false alarms and the oversight of high-risk patients. 70 Such uncertainty-aware predictions contribute to personalized prevention strategies, potentially lowering the incidence of hospital-acquired PIs and associated morbidity, mortality, and economic burden. 71
6. Limitations
Although the methodology and results were promising, several limitations should be considered when interpreting the findings. First, the analysis was conducted at a single tertiary-level healthcare institution with a cross-sectional design, which may limit the generalizability of the results to other clinical settings with different patient typologies, care protocols, or documentation practices. Consequently, the present work should be interpreted primarily as a methodological proof-of-concept rather than a directly deployable clinical prediction tool. External validation using multicenter datasets is necessary to assess the robustness and transportability of both the ML models and the CP framework.
Second, the sample size (n = 245) is relatively modest, which limits the applicability of high-cardinality approaches. Although cross-validation and class weighting were implemented to mitigate overfitting, the limited number of PIs may still compromise the stability of model estimates and reduce the precision of uncertainty quantification under the CP approach. This limitation reduces efficiency, leading to larger prediction sets or more frequent ambiguous outputs. Importantly, this behavior reflects increased uncertainty rather than model failure. Thus, CP prediction maintains its theoretical guarantees even in small samples. Regardless of this feature, increasing the sample size improves the generalizability of results and enhances uncertainty quantification with respect to efficiency, stability, and clinical usefulness.
Third, the variability in nursing records, potential inconsistencies in clinical annotations, and missing or incomplete data could introduce bias during preprocessing. Although feature engineering methods such as aggregation, conflict-resolution rules, and regularized target encoding were applied, residual measurement error cannot be entirely excluded.
Fourth, a theoretical requirement of CP is the assumption of exchangeability between calibration and future observations, representing a weak form of the i.i.d. assumption. In real clinical settings, this assumption may be challenged by temporal changes in clinical practice, heterogeneous patient populations, or differences in documentation protocols across institutions. In our study, the exchangeability assumption is plausible because the data originate from a single healthcare center, and the CP framework is applied within cross-validation splits derived from the same underlying data-generating process. Nevertheless, potential issues may arise when applying the framework in different institutions or evolving clinical settings. 72 Future research should evaluate adaptive or online CP approaches capable of maintaining validity under distributional shifts and heterogeneous clinical environments.
Fifth, from a computational perspective, the proposed framework is lightweight and feasible for practical implementation. The models evaluated rely on ensemble tree methods, whose computational costs primarily depend on the number of observations and the number of estimators. Given the modest dataset size and the bounded hyperparameter space used in the optimization process, model training was computationally inexpensive and completed on a standard workstation. The CP stage introduces only a training-calibration stage based on nonconformity scores and quantile calculations, adding negligible computational overhead. In deployment scenarios, model training and calibration can be performed offline, while real-time prediction for new patients only requires evaluating the model and applying the conformal threshold, operations that can be executed in milliseconds.39,40,73 However, while computational demands were manageable, large-scale deployment across hospital-wide monitoring systems may require optimized pipelines or approximate CP methods to maintain proper real-time performance.
Future research may complement the quantitative evaluation presented in this study with qualitative assessments involving clinicians to explore how uncertainty-aware predictions influence decision-making and workflow integration in real clinical environments.
7. Conclusions
This study highlights the potential value of integrating ML techniques with CP to enhance the prediction and prevention of PIs in healthcare settings. Traditional ML models have shown promise in identifying patient risk, but their limited ability to express uncertainty impedes clinical adoption. By integrating CP, predictive models can generate statistically valid confidence estimates, allowing clinicians to interpret model outputs with greater transparency and trust.
The incorporation of CP adds a critical layer of uncertainty quantification, allowing predictions to be accompanied by statistically grounded confidence estimates. This capability is particularly valuable in the context of PI prevention, where misclassifications can have severe consequences. By providing calibrated prediction intervals, CP enables clinicians to distinguish between high-certainty and uncertain predictions, facilitating more informed, risk-adaptive decision-making.
From a broader perspective, integrating CP contributes to the development of safe and reliable AI-driven clinical decision support systems. It enhances the transparency of model outputs, supports compliance with ethical and regulatory frameworks, and strengthens clinicians’ confidence in algorithmic recommendations. Moreover, CP offers adaptability across diverse clinical settings, addressing data heterogeneity and improving the generalization of predictive models.
However, realizing the full potential of CP in clinical workflows will require further validation across larger, multi-institutional datasets and prospective studies. Future research should evaluate the framework in prospective, multi-center settings, assessing its impact on clinical outcomes, such as the incidence of PI and on the interaction between clinicians and uncertainty-aware predictions. Further analysis should also focus on adaptive conformal prediction methods that maintain validity under evolving data distributions. Additionally, future work should explore the integration of CP models with Electronic Health Record systems to enable real-time decision support and assess the economic impact and cost-effectiveness of CP-based interventions to reduce the prevalence and cost of PIs.
CP represents a promising methodological step toward more uncertainty-aware, data-driven wound care. By providing reliable, uncertainty-aware risk predictions, CP bridges the gap between algorithmic accuracy and clinical applicability, offering a structured methodological framework to support the future development of safer and more effective PI prevention strategies.
Supplemental material
Supplemental material - Enhancing clinical reliability in pressure injury prediction: A conformal prediction approach with machine learning models
Supplemental material for Enhancing clinical reliability in pressure injury prediction: A conformal prediction approach with machine learning models by Fredy Barriga-Gallegos, Gonzalo Ríos-Vásquez, Hanns de la Fuente-Mella, Karen Ulloa Catalán and Naldy Febré Vergara in Digital Health.
Footnotes
Ethical considerations
This study used anonymized secondary data originally collected as part of routine clinical care. No personally identifiable information was accessed, and all records were de-identified prior to analysis. Formal authorization to access and use the data for research purposes was granted by the corresponding tertiary-level healthcare institution. The study protocol was reviewed and approved by the Scientific Ethics Committee of Nursing (CECENF), Faculty of Nursing, Universidad Andrés Bello (Approval No. L1CECENF 12 2021; approved on April 30, 2021). In accordance with institutional and national ethical guidelines, the requirement for individual informed consent was waived because the study involved only secondary, non-identifiable data and did not involve direct patient interaction. All procedures were conducted in accordance with the principles of the Declaration of Helsinki.
Authors contributions
Conceptualization: F.B.-G., G.R.-V., H.d.l.F.-M., N.F.; Methodology: F.B.-G., G.R.-V., H.d.l.F.-M.; Software: F.B.-G., G.R.-V.; Validation: F.B.-G., G.R.-V.; Formal Analysis: F.B.-G., G.R.-V.; Investigation: F.B.-G., G.R.-V., H.d.l.F.-M., N.F.; Resources: K.U.C.; Data Curation: F.B.-G., G.R.-V.; Writing-original draft: F.B.-G., G.R.-V., H.d.l.F.-M.; Writing-review and editing: F.B.-G., G.R.-V., H.d.l.F.-M., N.F.; Visualization: F.B.-G., G.R.-V.; Supervision: N.F. All authors have read and agreed to the published version of the manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research work of Gonzalo Ríos-Vásquez is partially supported by the National Agency for Research and Development (ANID). Scholarship program, Subdirectorate of Human Capital - National Doctorate 2024 - code 21240875.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Copyright
Copyright © 2016 SAGE Publications Ltd, 1 Oliver’s Yard, 55 City Road, London, EC1Y 1SP, UK. All rights reserved.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
