Enhancing clinical reliability in pressure injury prediction: A conformal prediction approach with machine learning models

Abstract

Background

Pressure injuries (PIs) remain a major global healthcare challenge, increasing morbidity and costs. While Machine Learning (ML) models effectively predict PI risk, their lack of uncertainty quantification limits clinical trust. Conformal Prediction (CP) addresses this issue by providing statistically valid confidence estimates, enhancing model transparency and reliability in clinical practice.

Objective

This study aims to improve the reliability and clinical applicability of PI prediction models by integrating CP with traditional ML algorithms, enabling uncertainty-aware predictions for safer and more transparent clinical decisions.

Methods

A methodological study was conducted to evaluate ML classifiers for PI risk classification using routinely collected clinical data and alternative data-processing strategies. Model performance was assessed through paired statistical comparisons across resampling folds. Models showing superior performance were subsequently calibrated using a CP framework to generate uncertainty-aware prediction sets, which were evaluated using coverage and efficiency metrics across different confidence levels.

Results

The overall incidence of PIs was 27%, with significant predictors including hospital ward, risk level, mattress use, containment, dermatitis, and age (p < 0.05). Among all models, XGBoost (XGB) and Random Forest (RF) showed the best predictive performance. After calibration with CP, both produced well-aligned uncertainty estimates. At α = 0.10, XGB achieved coverage = 0.949 and efficiency = 1.34, outperforming RF with tighter, more informative prediction sets, with a loss of 5% of cases outside the prediction set and an average size of the prediction set closer to 1.0 compared to the RF model. This framework enhanced model interpretability in clinical settings without compromising accuracy.

Conclusions

Integrating CP into ML models may improve the interpretability and reliability of risk predictions by quantifying uncertainty. Although the findings are promising, they should be interpreted with caution given the modest sample size, event rate, single-center design, and potential variability in clinical documentation. This framework provides a foundation for uncertainty-aware decision support in PI prevention.

Keywords

pressure ulcers machine learning conformal prediction risk prevention uncertainty quantification

1. Introduction

Pressure Injuries (PIs), also known as pressure ulcers, pressure sores, bedsores, or decubitus ulcers, are localized damage to the skin and underlying tissues caused by prolonged pressure, often in combination with friction or shear.¹ Their primary mechanism involves the interruption of blood flow, which restricts the supply of oxygen and nutrients, ultimately leading to tissue ischemia, cell death, and ulceration.² PIs are widely recognized as indicators of care quality within healthcare institutions and are associated with substantial negative impacts on patients’ health-related quality of life.³ Furthermore, they represent a significant economic and organizational burden on public and private healthcare systems.⁴ Common risk factors include prolonged immobility, shear forces, friction, moisture, diabetes, vascular disease, nutritional deficiencies, sensory impairments, incontinence, and overall high levels of care dependency.^5–7

Globally, the prevalence of PIs remains a significant concern. The overall prevalence among hospitalized patients is approximately 13%, with 62% of cases being hospital-acquired.⁸ In intensive care units (ICUs), prevalence reaches around 60%, with 53% being hospital-acquired and 23% being device-related.⁹ Long-term care facilities report prevalence rates near 20%, while nursing homes report rates close to 12%.^10,11 In the United States alone, an estimated 2.5 million individuals develop a PI annually during acute care.¹² Severe PIs are often classified as rare but largely preventable adverse events, commonly linked to avoidable medical errors.¹³

The economic burden of PIs is significant. In the United States, each hospital-acquired PI incurs an additional cost of approximately USD 10,700 per patient, contributing to an estimated annual total of USD 26.8 billion.¹⁴ In some health systems, PIs account for up to 7% of total expenditures.¹⁵ Moreover, patients with PIs have longer hospital stays and higher resource utilization.¹⁶

Risk assessment tools, particularly the Braden Scale,¹⁷ remain the standard approach to identifying at-risk patients across hospitals, nursing homes, and general care facilities.⁷ However, these tools rely on subjective clinical evaluation and may not fully capture the complexity of patient conditions, leading to a potential underestimation of risk.⁶ Consequently, there is a need for complementary, data-driven approaches that support early detection and guide targeted preventive interventions. In particular, predictive models capable of identifying high-risk patients while explicitly communicating prediction uncertainty may help clinicians prioritize preventive actions and improve patient safety. Preventive strategies, such as systematic repositioning and pressure-relieving surfaces, have proven effective in reducing PI incidence and severity.^18,19

In this context, advances in computational learning have yielded promising results that address the limitations of current methods in PI assessment. Particularly, the wide range of models within the Machine Learning (ML) framework appears as the natural step into PI risk assessment.²⁰ However, traditional ML models provide a single probability and require choosing an arbitrary threshold; as a result, this approach does not convey the reliability of the predictions. In nursing practice, decisions are rarely binary and are made in terms of confidence and uncertainty, while borderline cases are often escalated for additional assessment.

Conformal Prediction (CP) and its integration into ML models identify cases for which the model is uncertain, preventing over-reliance on automation. For PI prevention, this appears to be a critical stage, given that patient conditions change rapidly, data quality varies, and clinical heterogeneity is high. CP acknowledges uncertainty and, rather than hiding it, expands prediction sets as uncertainty increases. Furthermore, CP naturally creates a complementary triage system, patients with low and high risk can receive care according to the risk level, while ambiguous cases receive manual reassessments, mirroring real workflows in health systems.

Given the multi-factorial and dynamic nature of PI development, predictive models must account not only for complex interactions among clinical variables but also for the uncertainty inherent in patient care environments. In clinical practice, risk assessments rarely rely on deterministic predictions. Instead, healthcare professionals continuously evaluate degrees of certainty when prioritizing preventive interventions and allocating resources. In this context, CP aligns naturally with clinical reasoning by distinguishing between confident and uncertain predictions, thereby enabling safer human decision processes. Rather than forcing binary risk classifications, CP allows predictive systems to explicitly communicate when model outputs should be interpreted with caution, supporting clinicians in identifying cases that may require additional monitoring or reassessment.

One of the primary barriers for the adoption of computational models in health settings is the lack of transparent uncertainty estimates accompanying model predictions, which can reduce clinicians’s trust in automated systems and limit their usefulness in decision-support workflows.²¹ Many existing ML approaches provide only point predictions or probability scores, leaving healthcare professionals without clear guidance regarding the reliability of those predictions under varying clinical conditions. Consequently, there is a critical need for predictive frameworks that combine strong predictive performance with interpretable uncertainty quantification. In response to this gap, the present study aims to enhance the reliability and practical applicability of PI prediction models by integrating CP into an ML framework, an area where uncertainty-aware prediction remains unexplored. By providing statistically valid measures of predictive uncertainty, this approach seeks to improve the transparency, interpretability, and clinical usability of ML-based risk prediction tools, supporting safer and more informed preventive strategies in PI management.

1.1. Machine learning for pressure injury prediction

ML methods have become increasingly relevant in healthcare due to their ability to model complex, nonlinear relationships and support both diagnostic and prognostic tasks across diverse clinical applications.^22–24

Evidence across multiple domains, including mental health,²⁵ multimodal precision health,²⁶ oncology,²⁷ and healthcare management,²⁸ demonstrates the potential of ML to enhance decision-making.

For PI prediction, ML methods have been successfully applied using diverse data sources. Recent advances have shown promising results in early PI prediction, capturing complex relationships among clinical features, and have outperformed traditional risk assessment tools.^29–31 Studies have used electronic health records (EHRs) to model PI development in hospital units.^32–34 Other approaches leverage unstructured clinical notes through Natural Language Processing,³⁵ or integrate more complex representations such as graphical models, Bayesian networks, or deep learning models analyzing imaging data.^36–38

These ML models often outperform traditional tools by better capturing interactions among clinical variables and providing early identification of high-risk patients. However, clinicians need confidence measures to make safer decisions for their patients, and most existing approaches provide only point predictions, while clinical data is inherently noisy and uncertain. For clinicians, reliable confidence measures are essential to support safe decision-making.

Following the TRIPOD reporting principles, ML models were developed using baseline clinical variables to evaluate uncertainty-aware prediction sets under different modeling configurations. The integration of CP provides calibrated uncertainty estimates, supporting methodological evaluation of model reliability and robustness under limited sample sizes and potential data variability.³⁹

1.2. Conformal prediction

CP provides a principled framework for quantifying uncertainty around individual predictions. It generates prediction sets with formal coverage guarantees, making the output easy to interpret and grounded in statistical validity.⁴⁰ CP can be applied to any underlying ML model without requiring assumptions about the data distribution or additional model parameterization beyond choosing a confidence level.

Applications of CP span multiple fields, including biometrics, facial recognition, biochemistry, finance, and healthcare. In medical contexts, CP has supported tasks such as disease progression prediction under uncertainty,⁴¹ reliable genomic medicine,⁴² depression prediction,⁴³ lung cancer detection,⁴⁴ coronary artery disease diagnosis using multi-objective methods,⁴⁵ and data cleaning in biomedical sciences.⁴⁶

The standard CP methodology involves first selecting a nonconformity measure, then training an ML model, and third, computing nonconformity scores for the calibration and test sets. These steps enable the generation of valid, efficient uncertainty-aware predictions.³⁹

The clinical characteristics of PI development make this problem especially well-suited for uncertainty-aware modeling. PIs are rare, multifactorial, and highly sensitive to variations in mobility, hemodynamic stability, moisture exposure, device-related pressure, and the quality of preventive care. Small, clinically plausible changes in these factors can substantially modify a patient’s accurate risk profile. As a result, two patients may receive the same predicted probability while differing markedly in the reliability of that estimate. CP directly addresses this challenge by producing statistically calibrated prediction sets that indicate when the model is confident and when predictions should be interpreted with caution, for instance, during atypical patient presentations, incomplete records, or shifts in clinical practice. Studies in other high-stakes clinical domains demonstrate that CP can effectively flag uncertain or unreliable predictions before they propagate into clinical decision-making, thereby enhancing trustworthiness and reducing diagnostic risk.⁴⁷ This property aligns closely with the realities of PI prevention, where false negatives may result in avoidable harm, and false positives increase the preventive workload for nursing teams already operating under high demand. By signaling the degree of uncertainty for each patient-level prediction, CP enables more transparent, risk-aware decision support, thereby strengthening the clinical utility of ML-based PI prediction systems.

2. Methods

The methodological framework applied in this paper is summarized in Figure 1. The process consists of four main stages. First, a data preprocessing stage was conducted, which included formatting the dataset and selecting relevant features based on their statistical properties and clinical plausibility. Second, a model-fitting stage was performed using a set of traditional ML algorithms. Third, a statistical comparison of the models was conducted. Finally, CP was applied to the selected models. This step allowed for the construction of prediction sets with formal coverage guarantees. Performance metrics derived from CP, such as validity and efficiency, were then calculated and compared against traditional point-prediction measures to assess the added value of uncertainty quantification. Model training and evaluation were conducted in Python 3.11.5, while statistical analysis and pairwise comparisons were performed in R 4.4.2.

Figure 1.

Summary of the methodological framework integrating supervised machine learning with conformal prediction.

2. 1. Study population

The study adopted an observational, cross-sectional design and used data from a tertiary-level healthcare institution in an urban area of the Metropolitan Region of Santiago, Chile. Data collection was conducted through a probabilistic random sampling procedure implemented by trained healthcare professionals across multiple hospital wards to ensure the representativeness of the inpatient population. The sample size was calculated at a 95% confidence level, with a 5% margin of error and an anticipated attrition rate of 20%, yielding an initial recruitment target of approximately 300 patients. Following data cleaning procedures, specifically the exclusion of atypical observations, incomplete records, and duplicated entries, the final analytical sample comprised 245 patients.

Data were collected using standardized forms, and information was extracted from patients’ electronic health records. The instruments were specifically designed to capture comprehensive information on both patient-specific characteristics and variables associated with the risk and development of PIs. The data collection framework was organized into four main domains: (i) sociodemographic characteristics, (ii) clinical parameters, (iii) laboratory and diagnostic test requests documented by attending physicians, and (iv) PI-related variables encompassing occurrence, stage, and anatomical location.

PIs were classified in accordance with the criteria established by the National Pressure Injury Advisory Panel (NPIAP). The assessment followed a dichotomous classification approach. Patients exhibiting no evidence of skin or tissue damage consistent with the NPIAP definition, including those with intact skin displaying only transient, blanchable erythema, were categorized as negative for PI. Conversely, patients were classified as positive when at least one lesion met the diagnostic criteria for a PI, irrespective of its severity or stage (Stages 1-4 or suspected deep tissue injury). Accordingly, the presence of a single lesion at any anatomical site was sufficient for a positive PI classification.⁴⁸

2.2. Data cleansing and formatting

The data wrangling and preprocessing stage involved transforming unstructured nursing datasets into a format suitable for analysis. Relevant nursing documentation was extracted from the Hospital Information System (HIS) for each patient record. From these source data, key clinical concepts associated with the development of PIs were identified. This process was conducted in close collaboration with expert clinicians to ensure that text-based information was systematically converted into standardized and structured variables.

Since HIS records contain multiple entries per patient, the preprocessing stage considered a comprehensive aggregation and conflict-resolution strategy. For this purpose, time-window aggregation is used to group entries within clinically meaningful periods. For categorical data, conflicts in entries were resolved by selecting the most critical or informative values. For frequency-based features, values were aggregated to reflect the intensity or frequency of relevant events.

2.3. Feature selection

Feature selection was performed using the Chi-square test for categorical variables and the t-test for numerical features. The Chi-square test assesses whether the observed frequency distribution of two categorical variables deviates significantly from the expected distribution under the assumption of independence. The t-test identifies features with significant mean differences between outcome groups, supporting their inclusion in the modeling stage. This procedure ensured that only variables with meaningful statistical relationships to PI development were included in the training stage, thereby reducing dimensionality, improving model interpretability, and enhancing computational efficiency.⁴⁹

2.4. Feature encoding

Since the dataset comprised categorical variables, a feature encoding step was required to transform qualitative attributes into numerical representations suitable for ML algorithms. Two complementary methods were applied: One-Hot Encoding (OHE) and Target Encoding (TE).

OHE is a method that converts categorical variables into a numeric format without imposing any order or magnitude on the categories. OHE creates K − 1 new binary features, each indicating whether the observation belongs to a specific category, while one is dropped to avoid mutual information sharing among features. This means that each category k ∈ K gets its own slot, and each observation activates only one of them.

Another process for handling categorical features is TE. To understand TE, let’s consider a feature space in the form x = {x₁, …, x_p} and a target variable y_i ∈ {0, 1} for binary classification. The TE method transforms a categorical variable x_j ∈ {c₁, …, c_k} into numerical values based on the conditional expectation of the target y.⁵⁰ Its application is shown in (1).

T E (c_{k}) = P (y_{i} = 1 | C_{i} = c_{k}) = \frac{\sum_{i = 1}^{n} 1_{(C_{i} = c_{k})} y_{i}}{\sum_{i = 1}^{n} 1_{C_{i} = c_{k}}}

(1)

As a result, each category is replaced by its empirical mean. To prevent overfitting, a regularization method is applied, as shown in (2).

T E_{regularized} (c_{k}) = \frac{n_{c_{k}} {\bar{y}}^{c_{k}} + α \bar{y}}{n_{c_{k}} + α}

(2)

where

n_{c_{k}}

is the count of samples in category c_k,

{\bar{y}}^{c_{k}}

is the mean target within group c_k,

\bar{y}

is the global mean, and α > 0 is a smoothing parameter.

The key distinction between OHE and TE lies in their impact on dimensionality. Whereas OHE expands the feature space by generating one binary column per category, which may aggravate the curse of dimensionality. Conversely, TE encodes each category into a single continuous value, making it particularly suitable for high-cardinality categorical variables while retaining predictive information.⁵⁰

For both methods, the encoding is conducted in an out-of-sample manner to avoid data leakage. Specifically, for each resampling iteration, encoding parameters were estimated using the training subset and subsequently applied to the corresponding testing subset, ensuring that no information from validation outcomes influenced feature construction. In the case of TE, category-specific statistics derived from the outcome variable were computed only from training observations, thereby preventing any observation from contributing to its own encoded value. In the case of OHE, it was fitted on the training data to define the set of categorical levels, and the resulting information was applied unchanged to the validation data. This fold-wise preprocessing strategy guarantees separation between training and validation information, yielding unbiased performance estimates and mitigating optimistic bias arising from data leakage.⁵¹

2.5. Cross-validation

Cross-validation (CV) is a resampling technique used in ML and Statistics to evaluate a model’s performance, robustness, and generalization on unseen data. It helps prevent overfitting and ensures that the model’s evaluation is not dependent on randomness due to a specific setting.⁵² The basic idea is to repeatedly split the dataset into training and validation sets and compute performance metrics for each split. The results are averaged to obtain a more stable estimation of the performance metrics.

In this paper, a k-fold CV scheme was adopted. The dataset was divided into k approximately equal folds while preserving the distribution of the target variable (stratified CV). For each iteration, k − 1 folds were used for training, and the remaining fold served as the validation set. This process was repeated k times, ensuring that each observation was used once for validation and k − 1 times for training. The averaged results across folds provided reliable estimates of the models’ predictive performance for the PI classification task, along with the results of the CP framework embedded within this approach.

2.6. Machine learning models

A total of seven ML models for binary classification were evaluated during the analysis. The models included tree-based algorithms: Decision Tree (DT), Random Forest (RF), and Extreme Gradient Boosting (XGB) for sequential learning; a logit model: Logistic Regression (LR); and margin-based models: Support Vector Machines for classification (SVC) with different kernel types, particularly SVC with a linear kernel, SVC with a polynomial kernel, and SVC with a Radial Basis Function (RBF). The selection of such models reflects different learning paradigms suitable for handling both linear and nonlinear patterns presented in clinical data.

Although ensemble learning can combine heterogeneous models, in this study, we focused on a tree-based approach, specifically on RF. This selection is mainly due to the complexity of heterogeneous learning, which arises from the combination of different paradigms within an optimization framework. Furthermore, mixing heterogeneous models also increases the risk of overfitting when the sample size is limited, since there is insufficient data to train multiple models and perform CV reliably.⁵³

2.7. Hyperparameter optimization

To ensure optimal model performance and fair comparisons between the classifiers, a grid search hyperparameter optimization process was performed for each model. This process was embedded within the CV loop. The sets of hyperparameters were chosen according to standard best practices and systematically explored to minimize the risk of underfitting or overfitting. Table 1 shows the ranges of parameters explored for each classification algorithm. This optimization step is essential, as model performance is influenced not only by data quality but also by the appropriate specification of the model structure. This process enhances the robustness, generalization, and clinical utility of the predictive model.⁵⁴

Table 1.

Set of hyperparameters chosen for the optimization process conducted in the cross-validation setting.

Model	Hyperparameter	Set of values
DT	Splitting criterion	{Gini, entropy, log}
	Splitter	{Best, random}
	Maximum depth	[1,20]
	Size of samples in leaf	[2,20]
	Size of samples to split	[1,20]
	Number of features	{Sq. root, log}
	Class weights	{Standard, balanced}
LR	Penalty	{l₁, l₂}
	Penalization size	[1,100]
	Class weights	{Standard, balanced}
RF	Trees	[1,100]
	Splitting criterion	{Gini, entropy, log}
	Maximum depth	[1,20]
	Size of samples to split	[2,20]
	Size of samples per leaf	[1,20]
	Number of features	{Sq. root, log}
	Class weights	{Standard, balanced}
XGB	Number of trees	[10,100]
	Maximum depth	[1,20]
	Learning rate	[0.01, 1.0]
	Subsample	[0.2, 1.0]
	Sample by tree	[0.1, 1.0]
	Class weights	{Standard, balanced}
SVC	Kernel	{Linear, Poly, RBF}
	Control parameter	[0.1, 10]
	Degree	{2,3,4}
	Gamma	[0.1, 10]
	Class weights	{Standard, balanced}

Hyperparameter ranges shown in Table 1 were selected to cover plausible model complexity in a clinical context, aiming to reduce the risk of overfitting in a modest sample and to capture nonlinear interactions. Furthermore, the ranges keep the search computationally feasible within the nested cross-validation procedure.

The hyperparameter search setting is an exhaustive search looking for all combinations of parameters within the defined set of potential values. The choice of the specific values was based on common practices in applied ML. For the DT model, given the sample size, we used a bounded, theory-consistent search space that spans from strongly regularized trees to moderately flexible trees. In the case of LR, the regularization is addressed with two methods ({l₁, l₂}), using a penalization parameter that ensures the search covers both heavily regularized and lightly regularized regimes. For RF and XGB, moderated complexity models were tested, enforcing strong regularization controls. The number of estimators (trees) was capped at 100 because RF performance typically stabilizes beyond moderate ensemble sizes, especially in small datasets. In the case of XGB, hyperparameters were selected to balance predictive flexibility with strong regularization, which is critical in small datasets. In this sense, tree depth, learning rate, and number of trees in the boosting rounds jointly control model complexity and overfitting issues. Finally, in the case of SVC, if kernels are too flexible, the models can easily overfit, so the grid explores different functional forms to control complexity through regularization parameters.

The class imbalance issue was addressed by assigning class weights proportional to each category’s frequency. This approach prevents the models from biasing towards the majority class and improves their ability to classify the minority class.⁵⁵ The use of this approach, instead of artificial sampling methods, was primarily due to the limited sample size. In small datasets, synthetic samples (e.g., SMOTE, undersampling or oversampling methods) may increase the risk of overfitting by introducing unrealistic patterns that are not present in the actual data.⁵⁶

2.8. Performance metrics for binary classification

The performance metrics for the classification task are based on the confusion matrix generated from the actual and predicted values. This tool provides a tabular summary that compares predicted and true class labels. Its structure is shown in Table 2.

Table 2.

Confusion matrix for a binary classification task.

	Predicted positive	Predicted negative
Positive	True Positive (TP)	False Negative (FN)
Negative	False Positive (FP)	True Negative (TN)

From the Confusion Matrix, a set of metrics is computed to measure binary classification performance. Specifically, the Accuracy, Precision, and F1-Score are computed directly from the values provided by the classification models. Furthermore, the Area Under the Curve (AUC) is calculated from the Receiver Operating Characteristic (ROC) curve, which plots the True Positive Ratio (TPR) against the False Positive Ratio (FPR).⁵⁷ Furthermore, the Brier-Score is computed to provide a more nuanced evaluation of probability, whereas standard accuracy and precision metrics only look at the final classification label. It is worth mentioning that Accuracy, Precision, F1-Score, and AUC metrics look for the highest possible value, whereas the Brier-score looks for the lowest possible value.⁵⁸ Later, as each classification metric is computed across k folds, pairwise comparison methods are used to test for significant differences in model performance across the various metrics.

2.9. Statistical comparison of classification performance

The classification metrics are calculated across folds for various models. For this setting, Friedman’s test is employed, as performance metrics are paired within folds, providing evidence of significant differences in performance across models. Once the Friedman test is conducted and significant differences are identified, the Wilcoxon signed-rank post hoc test is used to determine which models exhibit significant differences in classification performance across folds. This test is suitable as it handles the paired structure in the data while controlling for the family-wise error rate.⁵⁹

2.10. Conformal prediction Framework

To understand the basis of CP, let’s consider z = x × y the input-output space forming a dataset in the form z = z₁, z₂, …, z_n, such that z_i = (x_i, y_i) is assumed to be exchangeable. This latest assumption is a weak form of the iid assumption for statistical models. Given a new input x_n+1, the goal is to construct a prediction set Γ_n(x_n+1) ⊆ y such that:

P (y_{n + 1} \in Γ_{n} (x_{n + 1})) \geq 1 - α

(3)

where α ∈ (0, 1) denotes a user-specified significance level. With the previous information, a nonconformity score function

A : z^{n} \times z \to R

is developed. This function

A

measures how strange a new example z_i is relative to a dataset. Let

\hat{p} (x) \in R^{K}

be a class of probabilities that a datapoint x_i belongs to the class k_i ∈ K. A nonconformity score for class y is denoted by (4). This nonconformity measure is defined as one minus the predicted probability assigned by the classifier to the true class. Consequently, observations that receive high predicted probabilities for their observed class are considered highly conforming. This probability-based nonconformity measure is particularly suitable as it directly translates model confidence into clinically interpretable uncertainty estimates. The use of this nonconformity score is based on the premise that model outputs are naturally interpreted as risk probabilities. This setting facilitates clearer decision support for the context of the application in which this work is framed.

α_{i} = 1 - \hat{p} {(x_{i})}_{y}

(4)

Then, for a new x_n+1, we define (5),

{\hat{p}}_{y} = \frac{| {i : α_{i} \geq α_{n + 1} (y)} | + 1}{m + 1}

(5)

And a prediction set is developed considering (6).

Γ_{n} (x_{n + 1}) = {y : p_{y} \geq α}

(6)

This setting can be applied to any ML algorithm. Furthermore, no additional parameterization is required to select the non-conformity measure. As stated above, a key advantage of CP is the validity of predictions provided by its application, considering the assumptions of exchangeability and randomness.^60,61

2.11. Conformal prediction metrics

To contrast the ML metrics and provide an objective measure of the CP results, the coverage and efficiency metrics are employed. Let n be the number of test samples, let y_i be the true label for the i-th test point, and let $Γ_{i}^{α}$ be the prediction interval at the confidence level 1 − α for an input x_i. The coverage is calculated as shown in (7).

C o v e r a g e = \frac{1}{n} \sum_{i = 1}^{n} 1 (y_{i} \in Γ_{i}^{α})

(7)

where 1(⋅) is the indicator function whose output is one if the true label is within the prediction interval

Γ_{i}^{α}

. The coverage measures the proportion of times that the true label is within the prediction interval It can be seen that for any model, this metric can be biased if each prediction interval contains all classes, achieving 100% coverage, since the true label is always within the prediction set. To address this issue, efficiency is employed. This metric is defined as the average size of prediction sets, and its calculation is shown in (8).

E f f i c i e n c y = \frac{1}{n} \sum_{i = 1}^{n} | Γ_{i}^{α} |

(8)

where |⋅| denotes the cardinality of a set. The lower this metric, the better the model, denoting a more informative predictor.

It is worth mentioning that the metrics shown in (7) and (8), along with the CP framework, benefit from high-performing models. Particularly, the selection of the models with the best performance yields the following properties⁴².

• Narrower prediction sets: A high-performing model assigns higher scores to the correct class and lower scores to others, allowing CP to generate more informative sets.

• Better adaptability: A weak model tends to give large sets to each instance, losing the nuance that CP can provide.

• Reduction of universal sets: If the model is extremely poor, the CP framework yields instances where the prediction set is empty or includes every possible label.

The methodological framework was implemented in two independent stages to avoid information leakage between model optimization and uncertainty quantification. In the first stage, multiple ML models were trained and tuned using a k − fold cross-validation setting to identify the best-performing models and their optimal configurations. Model selection was based on predictive performance from out-of-fold validation. In the second stage, CP was applied as a post-hoc uncertainty quantification framework. The hyperparameters selected in the first stage were fixed, and the selected models were retrained within a new k − fold cross-validation procedure. Out-of-fold predictions generated in this stage were used to compute CP sets and evaluate the coverage and efficiency metrics shown in (7) and (8).

3. Results

3.1. Occurrence of pressure injuries

The total incidence of PIs was approximately 27% among patients. About 20% of the injuries occurred in the Surgery ward, followed by the MQ IND and the critical patient wards, with 18% and 17%, respectively. About 62% of the patients were male. The average age of the patients was 59 years. Females were older, with an average age of 65 years, while males had an average age of 57 years. In terms of risk, about 39% of patients were categorized as high-risk, 22% as medium-risk, and 33% as low-risk, all during their admission to the health center units.

3.2. Univariate analysis of features for LPP prediction

The results of the feature selection for LPP prediction are shown in Table 3, along with the results of the Chi-Square test and the t-test. The first and second columns display the feature names and the categories, respectively. The third and fourth columns show descriptions for the group with PI and the group without PI. The fifth and sixth columns display the results of the feature selection method, including the statistic and p-value for each test. The age shows significant differences between the PI and non-PI groups. The Hospital Ward, the Risk assessment, the use of mattresses, the Containment, and the presence of Dermatitis are significant predictors of PI development.

Table 3.

Comparison of demographic, clinical, and care-related characteristics between patients with PI and without PI (NPI). Continuous variables are expressed as mean ± standard deviation, and categorical variables as counts. Statistics and p-values correspond to Chi-square test results for categorical data and t-test results for numerical variables.

Feature	PI	NPI	Statistic	p-value
Age	64 ± 20	58 ± 19	3.03	0.003*
Gender			0.00	$\sim 1.00$
Female	25	68
Male	41	111
Ward			19.68	0.006*
CI Huap	6	12
MQ IND	12	9
Medicine	9	47
Burned	6	11
TMT	6	19
UPC	11	19
Emergency	3	26
Surgery	13	36
Risk			27.81	$\sim 0 . 0^{*}$
High	41	54
Medium	16	39
Low	7	75
N.R.	2	11
Mattress			28.14	$\sim 0 . 0^{*}$
Yes	47	60
No	10	119
Position			6.19	0.102
Yes	23	42
No	42	133
DVA			0.28	0.595
Yes	3	4
No	63	175
Mobilization			0.71	0.398
Yes	5	7
No	61	172
Containment			8.63	0.003*
Yes	16	16
No	50	163
Dermatitis			5.35	0.021*
Yes	8	58
No	6	173

3.3. Model performance

Figure 2 compares how the choice of encoding techniques impacts model performance across different algorithms. The effect of encoding is model-dependent. Some algorithms benefit from TE, while others perform better with OHE. The DT model shows slightly higher variability with OHE across all metrics. TE yields more stable results, as highlighted by the AUC and F1-score metrics. The LR model shows similar results across encoding methods. Since this model is linear, OHE often works better, as it preserves feature independence. The RF model performs better with OHE across all metrics, particularly in precision and accuracy. The performance of the SVC (linear and polynomial) fluctuates, but TE tends to yield more consistent scores, especially in AUC and F1-score. The SVC RBF displays high variability with both encodings but shows slightly better median metrics with TE. The XGB consistently achieves the highest performance with TE, thanks to its compact numerical representations.

Figure 2.

Performance of various ML models in the classification task under one-hot encoding y Target encoding methods for encoding.

Overall, the AUC and accuracy tend to be higher with TE for nonlinear models (SVCs, XGB). The RF with OHE shows outstanding precision. In terms of the F1-score, the models achieve a more balanced performance under TE, indicating a better tradeoff between precision and recall. In summary, TE offers advantages for complex nonlinear models, possibly by reducing dimensionality and introducing numerical features. Details of the improvements in the hyperparameter optimization process with cross-validation are shown in Table S11 in the Supplementary Material.

The results of the pairwise comparisons depend on the feature encoding methods used during fitting. The Accuracy shows significant differences with both encoding methods (p-values $\leq 5 %$ ). The Precision yields significant differences only with OHE (p-values $\leq 1 %$ ). The AUC shows significant results for both methods (p-values $\leq 10 %$ ). The F1-score yields significant differences only for the TE method (p-value $\leq 1 %$ ). Finally, the Brier-Score yields significant results for both methods (p-values $\leq 5 %$ ).

Pairwise comparisons using the Wilcoxon signed-rank test reveal significant differences, primarily for the RF and XGB models. The RF shows significantly better accuracy than DT, LR, SVC Linear, and SVC RBF, using both OHE and TE methods (p-values $\leq 10 %$ ). The XGB performs significantly better than LR and SVC Linear in terms of accuracy using both encoding methods (p-values $\leq 5 %$ ). Regarding precision, RF and XGB appear to be the best-performing models, with significant differences compared to all other models, especially with TE encoding (p-values $\leq 10 %$ ). In terms of the Brier-Score, the XGB model outperforms all models except RF at the 10% significance level, while RF performs significantly better than DT and LR depending on the encoding method. The different metrics show significant differences in particular encodings. The RF and XGB models perform slightly better than SVC Linear, SVC Poly, and SVC RBF, using OHE and TE, in terms of AUC. The F1-score shows no significant differences. Details on the results of the pairwise comparisons are shown in Tables S1 to S10 in the Supplementary Material.

In summary, across all evaluation metrics, the boxplots in Figure 2 indicate that the selection of the encoding method is model-dependent. Nevertheless, RF and XGB yield superior performance in the classification task, which is confirmed by the Friedman test (see Supplemental Material for tables with Friedman test results), showing significant performance variation across algorithms. The post-hoc Wilcoxon signed-rank test with correction further reveals that RF and XGB perform significantly better than the remaining models at the 10% significance level. Furthermore, both models yield no difference in performance, as shown in Table 4, which includes a summary of the metrics for binary classification for both models and the two encoding methods. Thus, both visual evidence and statistical tests consistently highlight the dominance of RF and XGB in this particular classification task. Moreover, the TE encoding shows strong predictive performance and better reliability, which is highlighted by the results of Precision with a performance about 6% better for RF and 10% better for XGB. Given this, TE encoding is selected for further analysis using the CP framework.

Table 4.

Summary with average of performance metrics for RF and XGB models per encoding method, values in parenthesis denote the standard deviation within the cross-validation framework.

Metric	Model	OHE (%)	TE (%)
Accuracy	RF	82.9 (5.12)	84.1 (4.65)
Accuracy	XGB	82.4 (2.23)	84.5 (4.31)
AUC	RF	86.8 (5.69)	88.5 (4.23)
AUC	XGB	85.4 (5.62)	85.5 (5.08)
Precision	RF	76.5 (12.3)	81.5 (15.9)
Precision	XGB	70.7 (12.8)	80.0 (14.0)
F1-score	RF	65.1 (5.10)	66.3 (5.53)
F1-score	XGB	60.6 (5.53)	64.3 (6.64)
Brier-Score	RF	13.3 (1.91)	13.0 (2.24)
Brier-Score	XGB	13.3 (1.52)	12.6 (1.86)

3.4. Reliability of proposed models

Different values of confidence are tested, as shown in Figure 3. The chart above shows the ideal target coverage as a red dashed line. The closer the empirical line is to the red dashed target, the better the calibration of the conformal predictor. The blue (XGB) and green (RF) curves show the actual coverage achieved by both models. The models slightly underperform for some values of α. The chart below shows the average prediction set size for different values of α. As α increases, confidence decreases, and the prediction sets become smaller. In this sense, XGB tends to yield slightly smaller prediction sets than RF for the same α values, resulting in tighter intervals or fewer predicted labels on average, without compromising the model’s coverage.

Figure 3.

Empirical coverage and average prediction set size of conformal predictors applied to XGB and RF across different significance levels (α).

The choice of the α value depends on a trade-off between reliability and efficiency. As shown in Figure 4, the behavior of the CP metrics is model-dependent. In the context of PI, as pressure injury prevention has asymmetric costs, with missed high-risk cases leading to increased hospitalizations and serious harm, the CP framework must achieve high but practical coverage with actionable outputs. In this regard, if the α value is too small, prediction sets become large, and the model becomes uninformative; thus, the choice of α is analogous to selecting clinical risk cutoffs. Figure 4 shows the behavior of the coverage and set size for various α values. Regarding coverage, both models yield similar results; however, for values of α ≥ 0.10, the RF model shows greater variability than the XGB. In terms of efficiency, the RF model shows larger set sizes for α values above 10%. For this setting, a value of α = 0.10 ensures 90% coverage, representing a pragmatic compromise and reliability guarantees while maintaining prediction sets sufficiently specific to support actionable clinical decisions. Furthermore, this choice aligns with the goal of uncertainty-aware decision support rather than fully automated classification frameworks.

Figure 4.

Detailed results of coverage and set size for various alpha values in the conformal prediction setting.

The behavior of the empirical coverage being slightly higher than the nominal level is expected in CP frameworks due to the discrete nature of nonconformity scores and finite sample effects, which lead to conservative prediction sets. In practice, mild coverage is preferred to under-coverage for clinical decision-support contexts, as it ensures that the true outcome is included in the prediction set more frequently than the nominal guarantee.

Furthermore, at α = 0.10, XGB achieves a set size of 1.34. In a binary classification setting, this implies that the model produces a single, confident prediction in the majority of cases (around 66%), and assigns two-label prediction sets only when model uncertainty is high (around 34%).

Figure 5 shows a breakdown of CP setting results in terms of coverage and set size for an α = 10%. The CP prediction framework maintained coverage close to the nominal level across all patient risk groups. Low-risk patients exhibit the highest efficiency, with prediction sets that almost always contain a single label, indicating high confidence in model predictions. In contrast, medium- and high-risk patients produced larger prediction sets, reflecting increased uncertainty in the classification task. Across models, RF tends to produce larger prediction sets, whereas XGB exhibits more conservative behavior.

Figure 5.

Results of coverage and set size for patients with low, medium and high risk in the conformal prediction framework with confidence fixed at α = 10%.

4. Discussion

This study shows the potential of integrating CP into ML frameworks for predicting PIs. The findings highlight that conventional ML models, such as RF and XGB, can effectively identify patterns to address the risk of developing PIs. Results from models tested in this study yielded higher performance results compared to baseline methods, specifically unsupervised explorations and scales, such as Braden. In this sense, the literature has analyzed the effectiveness of such scales, yielding results of about 70% in terms of AUC and predictive performance,^62,63 while the ML models used in this analysis reached an AUC near 90% and predictive performance higher than 80%. Furthermore, the addition of CP introduces a crucial layer of uncertainty quantification, significantly enhancing the reliability, interpretability, and clinical utility of the predictions.

Our results suggest that CP-enabled models can provide valid confidence intervals or prediction sets, enabling clinicians to make better-informed decisions. In contrast to traditional models that output point predictions, conformal predictors communicate the degree of confidence associated with each prediction, allowing clinical teams to highlight when predictions are uncertain and require additional patient assessment. In this sense, Uncertainty-aware approaches are increasingly recognized as an important component of reliable medical AI systems.⁶⁴ This property is particularly valuable in clinical contexts where missed high-risk cases can lead to severe complications, extended hospitalization, or mortality.⁶⁵

From an operational standpoint, incorporating CP could improve clinical workflow efficiency. High-confidence predictions can trigger preventive measures (e.g., repositioning schedules and advanced support surfaces), while uncertain predictions can prompt manual reassessments. This stratified approach can optimize resource allocation and potentially reduce the economic burden of hospital-acquired PIs, which remain substantial worldwide.^66,67

From a clinical workflow perspective, prediction sets produced by CP may support differentiated decision pathways. When the model produces a confident prediction indicating a high risk of PI development, the patient could be automatically flagged for preventive interventions such as repositioning protocols or pressure-relieving surfaces. In contrast, when the prediction set reflects higher uncertainty, additional clinical assessment by nursing staff may be required before implementing targeted interventions. In this way, uncertainty-aware predictions help allocate clinical attention where algorithmic confidence is lower, supporting safer human-in-the-loop decision-making. In practice, uncertainty-aware predictions could be communicated to clinicians through simple categorical outputs derived from the prediction sets. Singleton prediction sets indicate confident risk classification, whereas larger prediction sets reflect higher uncertainty and may prompt enhanced clinical evaluation.

Nevertheless, several limitations warrant attention. The quality of the calibration set critically influences CP performance. Non-representative calibration data may introduce errors in the prediction regions. Additionally, conformal methods introduce modest computational overhead, which may affect real-time deployment in large-scale monitoring systems.⁶⁸ Future research should examine adaptive and online CP variants that update the framework dynamically as new data becomes available. Finally, although the current findings are promising, prospective validation and multi-center studies are needed to confirm generalizability across diverse patient cohorts and care settings.

Regarding computational complexity, the proposed framework relies on ensemble learning models such as RF and XGB, whose computational cost mainly depends on the number of trees and dataset size. Given the relatively small number of predictors and observations in this paper, model training and parameter specification were computationally inexpensive. The CP stage adds only a calibration step based on nonconformity scores, incurring minimal additional computational cost.

5. Clinical interpretation

The integration of CP into ML frameworks for PI prediction represents a significant advancement in clinical risk assessment. Traditional predictive tools rely on subjective criteria and often fail to account for the complex, dynamic interactions among physiological, environmental, and procedural factors that contribute to PI development. The study shows that CP improved model calibration and provided statistically valid uncertainty measures without compromising predictive performance. Besides, it can maintain not only high predictive accuracy but also produce valid confidence estimates. These calibrated uncertainty intervals enhance clinicians’ ability to interpret model outputs, enabling more transparent and trustworthy diagnostic support for patients at varying risk levels.^39,69

From a clinical standpoint, the ability to distinguish high-certainty from uncertain predictions offers an operational advantage in healthcare settings. When models indicate high confidence in risk classification, preventive interventions such as repositioning schedules, advanced mattress use, or early mobilization can be implemented immediately. Conversely, uncertain predictions can trigger additional clinical evaluations or monitoring, thereby optimizing the allocation of limited resources and reducing the likelihood of both false alarms and the oversight of high-risk patients.⁷⁰ Such uncertainty-aware predictions contribute to personalized prevention strategies, potentially lowering the incidence of hospital-acquired PIs and associated morbidity, mortality, and economic burden.⁷¹

6. Limitations

Although the methodology and results were promising, several limitations should be considered when interpreting the findings. First, the analysis was conducted at a single tertiary-level healthcare institution with a cross-sectional design, which may limit the generalizability of the results to other clinical settings with different patient typologies, care protocols, or documentation practices. Consequently, the present work should be interpreted primarily as a methodological proof-of-concept rather than a directly deployable clinical prediction tool. External validation using multicenter datasets is necessary to assess the robustness and transportability of both the ML models and the CP framework.

Second, the sample size (n = 245) is relatively modest, which limits the applicability of high-cardinality approaches. Although cross-validation and class weighting were implemented to mitigate overfitting, the limited number of PIs may still compromise the stability of model estimates and reduce the precision of uncertainty quantification under the CP approach. This limitation reduces efficiency, leading to larger prediction sets or more frequent ambiguous outputs. Importantly, this behavior reflects increased uncertainty rather than model failure. Thus, CP prediction maintains its theoretical guarantees even in small samples. Regardless of this feature, increasing the sample size improves the generalizability of results and enhances uncertainty quantification with respect to efficiency, stability, and clinical usefulness.

Third, the variability in nursing records, potential inconsistencies in clinical annotations, and missing or incomplete data could introduce bias during preprocessing. Although feature engineering methods such as aggregation, conflict-resolution rules, and regularized target encoding were applied, residual measurement error cannot be entirely excluded.

Fourth, a theoretical requirement of CP is the assumption of exchangeability between calibration and future observations, representing a weak form of the i.i.d. assumption. In real clinical settings, this assumption may be challenged by temporal changes in clinical practice, heterogeneous patient populations, or differences in documentation protocols across institutions. In our study, the exchangeability assumption is plausible because the data originate from a single healthcare center, and the CP framework is applied within cross-validation splits derived from the same underlying data-generating process. Nevertheless, potential issues may arise when applying the framework in different institutions or evolving clinical settings.⁷² Future research should evaluate adaptive or online CP approaches capable of maintaining validity under distributional shifts and heterogeneous clinical environments.

Fifth, from a computational perspective, the proposed framework is lightweight and feasible for practical implementation. The models evaluated rely on ensemble tree methods, whose computational costs primarily depend on the number of observations and the number of estimators. Given the modest dataset size and the bounded hyperparameter space used in the optimization process, model training was computationally inexpensive and completed on a standard workstation. The CP stage introduces only a training-calibration stage based on nonconformity scores and quantile calculations, adding negligible computational overhead. In deployment scenarios, model training and calibration can be performed offline, while real-time prediction for new patients only requires evaluating the model and applying the conformal threshold, operations that can be executed in milliseconds.^39,40,73 However, while computational demands were manageable, large-scale deployment across hospital-wide monitoring systems may require optimized pipelines or approximate CP methods to maintain proper real-time performance.

Future research may complement the quantitative evaluation presented in this study with qualitative assessments involving clinicians to explore how uncertainty-aware predictions influence decision-making and workflow integration in real clinical environments.

7. Conclusions

This study highlights the potential value of integrating ML techniques with CP to enhance the prediction and prevention of PIs in healthcare settings. Traditional ML models have shown promise in identifying patient risk, but their limited ability to express uncertainty impedes clinical adoption. By integrating CP, predictive models can generate statistically valid confidence estimates, allowing clinicians to interpret model outputs with greater transparency and trust.

The incorporation of CP adds a critical layer of uncertainty quantification, allowing predictions to be accompanied by statistically grounded confidence estimates. This capability is particularly valuable in the context of PI prevention, where misclassifications can have severe consequences. By providing calibrated prediction intervals, CP enables clinicians to distinguish between high-certainty and uncertain predictions, facilitating more informed, risk-adaptive decision-making.

From a broader perspective, integrating CP contributes to the development of safe and reliable AI-driven clinical decision support systems. It enhances the transparency of model outputs, supports compliance with ethical and regulatory frameworks, and strengthens clinicians’ confidence in algorithmic recommendations. Moreover, CP offers adaptability across diverse clinical settings, addressing data heterogeneity and improving the generalization of predictive models.

However, realizing the full potential of CP in clinical workflows will require further validation across larger, multi-institutional datasets and prospective studies. Future research should evaluate the framework in prospective, multi-center settings, assessing its impact on clinical outcomes, such as the incidence of PI and on the interaction between clinicians and uncertainty-aware predictions. Further analysis should also focus on adaptive conformal prediction methods that maintain validity under evolving data distributions. Additionally, future work should explore the integration of CP models with Electronic Health Record systems to enable real-time decision support and assess the economic impact and cost-effectiveness of CP-based interventions to reduce the prevalence and cost of PIs.

CP represents a promising methodological step toward more uncertainty-aware, data-driven wound care. By providing reliable, uncertainty-aware risk predictions, CP bridges the gap between algorithmic accuracy and clinical applicability, offering a structured methodological framework to support the future development of safer and more effective PI prevention strategies.

Supplemental material

Supplemental material - Enhancing clinical reliability in pressure injury prediction: A conformal prediction approach with machine learning models

Supplemental material for Enhancing clinical reliability in pressure injury prediction: A conformal prediction approach with machine learning models by Fredy Barriga-Gallegos, Gonzalo Ríos-Vásquez, Hanns de la Fuente-Mella, Karen Ulloa Catalán and Naldy Febré Vergara in Digital Health.

Footnotes

ORCID iDs

Fredy Barriga-Gallegos

Gonzalo Ríos-Vásquez

Hanns de la Fuente-Mella

Naldy Febré Vergara

Ethical considerations

This study used anonymized secondary data originally collected as part of routine clinical care. No personally identifiable information was accessed, and all records were de-identified prior to analysis. Formal authorization to access and use the data for research purposes was granted by the corresponding tertiary-level healthcare institution. The study protocol was reviewed and approved by the Scientific Ethics Committee of Nursing (CECENF), Faculty of Nursing, Universidad Andrés Bello (Approval No. L1CECENF 12 2021; approved on April 30, 2021). In accordance with institutional and national ethical guidelines, the requirement for individual informed consent was waived because the study involved only secondary, non-identifiable data and did not involve direct patient interaction. All procedures were conducted in accordance with the principles of the Declaration of Helsinki.

Authors contributions

Conceptualization: F.B.-G., G.R.-V., H.d.l.F.-M., N.F.; Methodology: F.B.-G., G.R.-V., H.d.l.F.-M.; Software: F.B.-G., G.R.-V.; Validation: F.B.-G., G.R.-V.; Formal Analysis: F.B.-G., G.R.-V.; Investigation: F.B.-G., G.R.-V., H.d.l.F.-M., N.F.; Resources: K.U.C.; Data Curation: F.B.-G., G.R.-V.; Writing-original draft: F.B.-G., G.R.-V., H.d.l.F.-M.; Writing-review and editing: F.B.-G., G.R.-V., H.d.l.F.-M., N.F.; Visualization: F.B.-G., G.R.-V.; Supervision: N.F. All authors have read and agreed to the published version of the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research work of Gonzalo Ríos-Vásquez is partially supported by the National Agency for Research and Development (ANID). Scholarship program, Subdirectorate of Human Capital - National Doctorate 2024 - code 21240875.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Copyright

Supplemental material

Supplemental material for this article is available online.

References

Hajhosseini

Longaker

Gurtner

. Pressure Injury. Annals of Surgery 2020; 271(4):4th ed. 671–679. https://doi.org/10.1097/SLA.0000000000003567

Jiang

Hou

Dong

, et al. Skin temperature and vascular attributes as early warning signs of pressure injury. Journal of tissue viability 2020; 258–263. https://doi.org/10.1016/j.jtv.2020.08.001

Liu

Rawson

Islam

, et al. Impact of pressure injuries on health-related quality of life: A systematic review. Wound Repair and Regeneration 2024; 33(1):1st ed. https://doi.org/10.1111/wrr.13236

Sen

Gordillo

Roy

, et al. Human skin wounds: A major and snowballing threat to public health and the economy. Wound Repair and Regeneration 2009; 17(6):6th ed. 763–771. https://doi.org/10.1111/j.1524-475X.2009.00543.x

Dube

Sidambe

Verdon

, et al. Risk factors associated with heel pressure ulcer development in adult population: A systematic literature review. Journal of tissue viability 2022; 1st ed. 84–103. https://doi.org/10.1016/j.jtv.2021.10.007

Alderden

Cummins

Pepper

, et al. (Midrange Braden subscale scores are associated with increased risk for pressure injury development among critical care patients. Journal of Wound Ostomy & Continence Nursing 2017; 44nd ed. 420–428. https://doi.org/10.1097/WON.0000000000000349

Serafin

Graziadio

Velickovic

, et al. A systematic review of clinical practice guidelines and other best practice recommendations for pressure injury risk assessment in the United States. Wound Repair and Regeneration 2025; 33(2): 2nd ed. https://doi.org/10.1111/wrr.70016

Velozo

Hong

Venâncio

, et al. Pressure injury: update on general concepts, clinical aspects, and laboratory findings – Part I. Anais Brasileiros de Dermatologia 2025; 100(5): 5th ed. https://doi.org/10.1016/j.abd.2025.501187

Alshahrani

Middleton

Rolls

, et al. Pressure injury prevalence in critical care settings: An observational pre-post intervention study. Nursing Open 2024; 11(2): 11th ed. https://doi.org/10.1002/nop2.2110

10.

Aljezawi

Al Qadire

Al Omari

, et al. Hospital acquired pressure injuries prevalence and preventive measures in Omani critical care units: A multicenter cross-sectional study. Journal of Tissue Viability 2024; 33(4): 33rd ed. 808–813.https://doi.org/10.1016/j.jtv.2024.11.001

11.

Sugathapala

RDUP

Latimer

Balasuriya

, et al. Prevalence and incidence of pressure injuries among older people living in nursing homes: A systematic review and meta-analysis . International Journal of Nursing Studies 2023; 148: https://doi.org/10.1016/j.ijnurstu.2023.104605

12.

Reddy

Gill

Kalkar

, et al. Treatment of pressure ulcers: a systematic review. Jama JAMA, 2008; 300 2647–2662. https://doi.org/10.1001/jama.2008.778

13.

Rondinelli

Zuniga

Kipnis

, et al. Hospital-Acquired Pressure Injury: Risk-Adjusted Comparisons in an Integrated Healthcare Delivery System. Nursing Research 2018; 67(1): 67th ed. 16–25. https://doi.org/10.1097/NNR.0000000000000258

14.

Padula

Delarmente

. The national cost of hospital‐acquired pressure injuries in the United States. International wound journal 2019; 16 634–640. https://doi.org/10.1111/iwj.13071

15.

Ramos-Sánchez

Martínez-Beltrán

Egea-Zerolo

, et al. Cost of Illness of Pressure Injuries in the Inpatient Area of a Socio-Health Center in Spain. Advances in Skin & Wound Care 2025; 38(2):38th ed.E6–E11. https://doi.org/10.1097/ASW.0000000000000272

16.

Triantafyllou

Chorianopoulou

Kourkouni

, et al. Prevalence, incidence, length of stay and cost of healthcare-acquired pressure ulcers in pediatric populations: A systematic review and meta-analysis. International Journal of Nursing Studies 2021; 115: 115th ed. https://doi.org/10.1016/j.ijnurstu.2020.103843

17.

Braden

Bergstrom

. A Conceptual Schema for the Study of the Etiology of Pressure Sores. Rehabilitation Nursing 1987; 12(1): 12nd ed. 8–12. https://doi.org/10.1002/j.2048-7940.1987.tb00541.x

18.

Shi

Dumville

Cullum

, et al. Beds, overlays and mattresses for preventing and treating pressure ulcers: an overview of Cochrane Reviews and network meta-analysis . Cochrane Database of Systematic Reviews 2021; 8: https://doi.org/10.1002/14651858.CD013761.pub2

19.

Gillespie

Walker

Latimer

, et al. Repositioning for pressure injury prevention in adults: An abridged Cochrane systematic review and meta-analysis. 120th. International Journal of Nursing Studies 2021; 120: https://doi.org/10.1016/j.ijnurstu.2021.103976

20.

Barriga-Gallegos

Ríos-Vásquez

Morgado

, et al. Early prediction of pressure injury risk in hospitalized patients using supervised machine learning models based on nursing records. Scientific Reports 2026; 16(1): https://doi.org/10.1038/s41598-026-35709-w

21.

Kauttonen

Rousi

Alamäki

. Trust and Acceptance Challenges in the Adoption of AI Applications in Health Care: Quantitative Survey Analysis. Journal of medical Internet research 2025; 27: e65567. https://doi.org/10.2196/65567

22.

Dhanka

Kumar

Maini

, et al. Padding interpolation, median imputation, RobustScalar, and particle swarm optimization with heterogeneous classifiers: a robust combination for effective heart disease diagnosis. Frontiers in Medicine 2026; 12: https://doi.org/10.3389/fmed.2025.1721740

23.

Dhanka

Bhardwaj

Maini

. Comprehensive analysis of supervised algorithms for coronary artery heart disease detection. Expert Systems 2023; 40(7): e13300. https://doi.org/10.1111/exsy.13300

24.

Arvanitis

White

Harrison

, et al. A method for machine learning generation of realistic synthetic datasets for validating healthcare applications. Health Informatics Journal 2022; 28(2): https://doi.org/10.1177/14604582221077000

25.

Iyortsuun

Kim

S-H

Jhon

, et al. A Review of Machine Learning and Deep Learning Approaches on Mental Health Diagnosis. Healthcare 2023; 11. 285. https://doi.org/10.3390/healthcare11030285

26.

Kline

Wang

, et al. Multimodal machine learning in precision health: A scoping review. npj Digital Medicine 2022; 5(1): 5th ed. 171. https://doi.org/10.1038/s41746-022-00712-8

27.

Kourou

Exarchos

Papaloukas

, et al. Applied machine learning in cancer research: A systematic review for patient diagnosis, classification and prognosis. Computational and Structural Biotechnology Journal 2021; 19: 19th ed. 5546–5555. https://doi.org/10.1016/j.csbj.2021.10.006

28.

Preti

Ardito

Compagni

, et al. Implementation of Machine Learning Applications in Health Care Organizations: Systematic Review of Empirical Studies. Journal of Medical Internet Research 2024; 26: 26th ed. https://doi.org/10.2196/55897

29.

Zhou

Yang

, et al. A systematic review of predictive models for hospital-acquired pressure injury using machine learning. Nursing Open 2023; 10(3): 10th ed. 1234–1246. https://doi.org/10.1002/nop2.1429

30.

Toffaha

Simsekler

MCE

Atif

. Leveraging artificial intelligence and decision support systems in hospital-acquired pressure injuries prediction: A comprehensive review. Artificial Intelligence in Medicine 2023; 141: 141st ed. https://doi.org/10.1016/j.artmed.2023.102560

31.

Pei

Guo

Tao

, et al. Machine learning-based prediction models for pressure injury: A systematic review and meta-analysis. International Wound Journal 2023; 20(10): 20th ed. 4328–4339. https://doi.org/10.1111/iwj.14280

32.

Padula

Armstrong

Pronovost

, et al. Predicting pressure injury risk in hospitalised patients using machine learning with electronic health records: a US multilevel cohort study. BMJ Open 2024; 14(4): 14th ed. https://doi.org/10.1136/bmjopen-2023-082540

33.

Nguyen

K-A-N

Patel

Edalati

, et al. Electronic-Medical-Record-Driven Machine Learning Predictive Model for Hospital-Acquired Pressure Injuries: Development and External Validation. Journal of Clinical Medicine 2025; 14(4): 14th ed. 1175. https://doi.org/10.3390/jcm14041175

34.

Chang

S-C

Lai

S-M

M-W

, et al. Improving machine learning algorithm for risk of early pressure injury prediction in admission patients using probability feature aggregation. DIGITAL HEALTH 2025; 11: 11th ed. https://doi.org/10.1177/20552076251323300

35.

Sotoodeh

Zhang

, et al. An AdaBoost-based algorithm to detect hospital-acquired pressure injury in the presence of conflicting annotations. Computers in Biology and Medicine 2024; 168: 168th ed. https://doi.org/10.1016/j.compbiomed.2023.107754

36.

Charon

Wuillemin

CP-H

Havrengg-Théry

, et al. One Month Prediction of Pressure Ulcers in Nursing Home Residents with Bayesian Networks. Journal of the American Medical Directors Association 2024; 25(6): 25th ed. https://doi.org/10.1016/j.jamda.2024.01.014

37.

Walther

Heinrich

Schmitt

, et al. Prediction of inpatient pressure ulcers based on routine healthcare data using machine learning methodology. Scientific Reports 2022; 12(1): 12nd ed. https://doi.org/10.1038/s41598-022-09050-x

38.

Shinkawa

Mugita

Takahashi

, et al. A novel skin temperature estimation system for predicting pressure injury occurrence based on continuous body sensor data: A pilot study. Clinical Biomechanics 2025; 122: 122nd ed. https://doi.org/10.1016/j.clinbiomech.2024.106413

39.

Vazquez

Facelli

. Conformal prediction in clinical medical sciences. Journal of Healthcare Informatics Research 2022; 6. https://doi.org/10.1007/s41666-021-00113-8

40.

Vovk

Gammerman

Shafer

. Algorithmic learning in a random world. 1st ed. MA: Springer US, 2022. https://doi.org/10.1007/978-3-031-06649-8

41.

Sreenivasan

Vaivade

Noui

, et al. Conformal prediction enables disease course prediction and allows individualized diagnostic uncertainty in multiple sclerosis. npj Digital Medicine 2025; 8(1): 8th ed. 224. https://doi.org/10.1038/s41746-025-01616-z

42.

Papangelou

Kyriakidis

Natsiavas

, et al. Reliable machine learning models in genomic medicine using conformal prediction. Frontiers in Bioinformatics 2025; 5: 5th ed. https://doi.org/10.3389/fbinf.2025.1507448

43.

Zhou

. Conformal Depression Prediction. IEEE Transactions on Affective Computing 2025; https://doi.org/10.1109/TAFFC.2025.3542023

44.

Zhan

Wang

Yang

, et al. An electronic nose-based assistive diagnostic prototype for lung cancer detection with conformal prediction. Measurement 2020; 158: 158th ed. https://doi.org/10.1016/j.measurement.2020.107588

45.

Afzali

Maleki

Tavakkoli-Moghaddam

, et al. Coronary artery disease diagnosis by integrating conformal prediction with a multi-objective evolutionary algorithm. Health and Technology 2025; 15(6): 1119–1133. https://doi.org/10.1007/s12553-025-01007-0

46.

Zhan

Zheng

, et al. Reliability-enhanced data cleaning in biomedical machine learning using inductive conformal prediction. PLOS Computational Biology 2025; 21(2): 21st ed. https://doi.org/10.1371/journal.pcbi.1012803

47.

Olsson

Kartasalo

Mulliqi

, et al. Estimating diagnostic uncertainty in artificial intelligence assisted pathology using conformal prediction. Nature Communications 2022; 13(1): 13rd ed. 7761. https://doi.org/10.1038/s41467-022-34945-8

48.

VanGilder

Cox

Edsberg

, et al. Pressure Injury Prevalence in Acute Care Hospitals With Unit-Specific Analysis: Results From the International Pressure Ulcer Prevalence (IPUP) Survey Database. Journal of Wound Ostomy & Continence Nursing 2011; 48 Journal of Wound, Ostomy & Continence Nursing, 492–503. https://doi.org/10.1097/WON.0000000000000817

49.

Zhai

Song

Liu

, et al. A Chi-Square Statistics Based Feature Selection Method in Text Classification . In: IEEE 9th International Conference on Software Engineering and Service Science (ICSESS) 2018; 160–163. https://doi.org/10.1109/ICSESS.2018.8663882

50.

Pargent

Pfisterer

Thomas

, et al. Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics 2022; 37: 37th ed. 2671–2692. https://doi.org/10.1007/s00180-022-01207-6

51.

Apicella

Isgrò

Prevete

. Don’t push the button! Exploring data leakage risks in machine learning and transfer learning. Artificial Intelligence Review 58(11): 58th ed. 339.https://doi.org/10.1007/s10462-025-11326-3

52.

Ríos-Vásquez

de la Fuente-Mella

Ceroni-Díaz

, et al. Group-specific SVM with bilevel programming methods for parameter optimization and explainable AI in urban quality of life prediction. IEEE Access 2025; 13: 130475–130491. https://doi.org/10.1109/ACCESS.2025.3592156

53.

Zantvoort

Nacke

Görlich

, et al. Estimation of minimal data sets sizes for machine learning predictions in digital mental health interventions. npj Digital Medicine 2024; 7(1): 7th ed. 361. https://doi.org/10.1038/s41746-024-01360-w

54.

Dhilsath

Samuel

. Hyperparameter tuning of ensemble classifiers using grid search and random search for prediction of heart disease. Computational intelligence and healthcare informatics 2021; 139–158. https://doi.org/10.1002/9781119818717.ch8

55.

Shyalika

Wickramarachchi

Sheth

. A Comprehensive Survey on Rare Event Prediction. ACM Computing Surveys 2024; 3rd ed. https://dl.acm.org/doi/10.1145/3699955

56.

Saekhu

Berlilana

Saputra

DIS

. Comparative Analysis of Data Balancing Techniques for Machine Learning Classification on Imbalanced Student Perception Datasets. Journal Teknik Informatika (Jutif) 2025; 6(2): 6th ed. 627–640. https://doi.org/10.52436/1.jutif.2025.6.2.4286

57.

de Giorgio

Cola

Wang

. Systematic review of class imbalance problems in manufacturing. Journal of Manufacturing Systems 2023; 71: 71st ed. 620–644. https://doi.org/10.1016/j.jmsy.2023.10.014

58.

Fergus

Chalmers

. Performance evaluation metrics. Applied Deep Learning: Tools, Techniques, and Implementation 2022; 115–138. https://doi.org/10.1007/978-3-031-04420-5_5

59.

Cleophas

Zwinderman

. Non-parametric Tests for Three or More Samples (Friedman and Kruskal-Wallis). Clinical data analysis on a pocket calculator: understanding the scientific methods of statistical reasoning and hypothesis testing Springer International Publishing, 2016; 193–197. https://doi.org/10.1007/978-3-319-27104-0_34

60.

Vovk

. Conditional validity of inductive conformal predictors. Asian conference on machine learning 2013; 92nd ed. 475–490. https://proceedings.mlr.press/v25/vovk12.html

61.

Toccaceli

. Introduction to conformal predictors. Pattern Recognition 2022; 124: 108507. https://doi.org/10.1016/j.patcog.2021.108507

62.

Tao

Zhang

, et al. Comparison of the predictive validity of the Braden and Waterlow scales in intensive care unit patients: A multicentre study. Journal of Clinical Nursing 2023; 33(5): 1809–1819. https://doi.org/10.1111/jocn.16946

63.

Aslan Basli

Yavuz Van Giersbergen

Özdemir

. Comparison of the predictive validity of the Braden, Munro and 3S scales in surgical patients. Journal of Tissue Viability 2024; 33(4): 657–665. https://doi.org/10.1016/j.jtv.2024.06.011

64.

Zeng

Ahmed

Tunio

. Exploring Uncertainty in Medical Federated Learning: A Survey. Electronics 2025; 14(20): 4072. https://doi.org/10.3390/electronics14204072

65.

Huang

. The Visualization of the Importance of Covariance Importance in a Machine Learning Model for Advanced Liver Fibrosis in a Nationally Representative Sample. JGH Open 2025; 9(7): 9th ed. https://doi.org/10.1002/jgh3.70200

66.

Abdulai

ASB

Storm

Jean

. I don’t know”: An uncertainty-aware machine learning model for predicting patient disposition at emergency department triage. International Journal of Medical Informatics 2025; 201. https://doi.org/10.1016/j.ijmedinf.2025.105957

67.

Amiot

Potier

. Artificial Intelligence (AI) and Emergency Medicine: Balancing Opportunities and Challenges JMIR Medical Informatics, 2025; 13. https://doi.org/10.2196/70903

68.

Hong

Eclov

NCW

Stephens

, et al. Implementation of machine learning in the clinic: challenges and lessons in prospective deployment from the System for High Intensity EvaLuation During Radiation Therapy (SHIELD-RT) randomized controlled study. BMC Bioinformatics 2022; 23(12): 23rd ed. 408. https://doi.org/10.1186/s12859-022-04940-3

69.

Majlatow

Shakil

Emrich

, et al. Uncertainty-Aware Predictive Process Monitoring in Healthcare: Explainable Insights into Probability Calibration for Conformal Prediction. Applied Sciences 2025; 15(14): 15th ed. https://doi.org/10.3390/app15147925

70.

Yang

Chen

, et al. Development and Validation of an Interpretable Conformal Predictor to Predict Sepsis Mortality Risk: Retrospective Cohort Study. Journal of Medical Internet Research 2024; 26: 26th ed. https://doi.org/10.2196/50369

71.

Sun

Shao

, et al. Practical AI application in psychiatry: historical review and future directions. Molecular Psychiatry, 2025; 30. 4399. https://doi.org/10.1038/s41380-025-03072-3

72.

Zhou

Chen

Gui

, et al. Conformal Prediction: A Data Perspective. ACM Computing Surveys 2025; 58(2): 1–37. https://doi.org/10.1145/3736575

73.

Liaw

Sheridan

, et al. Development and Evaluation of Conformal Prediction Methods for Quantitative Structure–Activity Relationship. ACS Omega 2024; 9(27): 29478–29490. https://doi.org/10.1021/acsomega.4c02017

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.43 MB