Sage Journals: Discover world-class research

Abstract

Objective

The objective of this study was to assess the predictability of admissions to a MH inpatient ward using ML models, based on routine data collected during triage in EDs. This research sought to identify the most effective ML model for this purpose while considering the practical implications of model interpretability for clinical use.

Methods

The study utilised existing data from January 2016 to December 2021. After data pre-processing, an exploratory analysis revealed the non-linear nature of the dataset. Six different ML models were tested: Random Forest, XGBoost, CatBoost, k-Nearest Neighbours (kNN), Explainable Boosting Machine (EBM) using InterpretML, and Support Vector Machine using Support Vector Classification (SVC). The performance of these models was evaluated using various metrics including the Matthews Correlation Coefficient (MCC).

Results

Among the models evaluated, the CatBoost model achieved the highest MCC score of 0.1952, demonstrating superior balanced accuracy and predictive power, particularly in correctly identifying positive cases. The InterpretML model also performed well, with an MCC score of 0.1914. While CatBoost showed strong predictive capabilities, its complexity poses challenges for clinical interpretation. Conversely, the InterpretML model, though slightly less powerful, offers better transparency and is more practical for clinical use.

Conclusion

The findings suggest that the CatBoost model is a compelling choice for scenarios prioritising the detection of positive cases. However, the InterpretML model's ease of interpretation makes it more suitable for clinical application. Integrating explanation methods like SHAP with non-linear models could enhance model transparency and foster clinician trust. Further research is recommended to refine non-linear models within decision support systems, explore multi-source data integration, understand clinician attitudes towards ML, and develop real-time data collection systems. This study highlights the potential of ML in predicting MH admissions from ED data while stressing the importance of interpretability, ethical considerations, and ongoing validation for successful clinical implementation.

Keywords

Mental health emergency department first nation aboriginal and torres strait islander

Introduction

Suicide and self-harm are global public health concerns with devastating consequences for individuals, families, and communities.¹ Approximately 5% of the adult world population have depression,¹ and approximately 700,000 people die by suicide each year.¹ In Australia, there was an average of nine deaths by suicide each day in 2021.² Many people seek initial, acute or emergency mental health (MH) assistance at their local hospital emergency department (ED) with recent research identifying three main themes that characterise help-seeking dynamics for MH conditions in ED: First Nations MH, suicidal ideation (SI), and access and egress pathways.^3,4 EDs are busy environments populated by clinical, administrative, security, visiting staff (ambulance, police and other agencies), and with patients typically arriving by their own means, via ambulance, police, carers, or family. They are characterised with loud noise, bright lights, busy activity and usually open every day across the 24 h period.⁵ ED staff frequently express feeling unequipped to provide care for MH patients,⁵ with longer waiting periods common following arrival to ED for MH presentations⁶ leading to dissatisfaction among many MH patients in ED.⁷ For those unwell enough to require admission or transfer to an inpatient facility, the stay in the ED is generally more extended than those discharged from ED.⁸ One potential approach to detect MH presentations earlier in the ED experience is to enhance the analysis and interpretation of regularly collected data pertaining to patient presentations in EDs. It involves the application of Machine Learning (ML), which has emerged as a powerful tool in healthcare, offering promising avenues for predicting significant risks such as suicide and self-harm.^9–13 Therefore, the aim of this research is to assess the predictability of admissions to a MH inpatient ward using ML modelling from routine data collected at the point of triage in the ED.

However, the usage of ML is not without its challenges and ethical considerations. One of the crucial issues is the potential for bias in ML models when dealing with vulnerable populations and the possibility of producing further discrimination.¹⁴ Ensuring transparency and interpretability of ML models is equally important, as it is essential for gaining support among clinicians and patients.¹⁵ Previous research into the use ML in MH for the prediction of SI and self-harm behaviours has yielded positive results, for example, Reale, Novak¹³ researched the development of an intervention that used predictive analytics to inform care teams about their patient's risk of suicide attempts. Iorfino, Ho¹¹ developed self-harm monitoring ML models, which helped identify a large subpopulation that would benefit from targeted interventions and helped identify factors that contribute to self-harming behaviours. Similarly, research has been undertaken into the use of ML for the prediction of admission by Hong, Haimovich¹⁰ testing three different ML models in the ED setting that used a combination of triage and patient history data. Raita, Goto¹² tested four machine learning (ML) models that all outperformed the benchmark set in the study. The models were built to predict outcomes related to critical care and hospitalisation, and improved clinicians’ decision-making process.¹² In a recent systematic review of clinical decision support systems (CDSS) in ED, Fernandes, Vieira⁹ reported that the use CDSS improved decision-making by health professionals and led to better clinical management and patient outcomes. However, Fernandes, Vieira⁹ go on to state that more than half of the studies lacked clear implementation and performance measures. Contrasting the previous studies, this research will use a minimal dataset collected at the point of entry to the ED where only triage diagnostics have occurred. This study seeks to determine if basic presentation information can be used to predict the need for admission to an acute inpatient unit. However, recent research has demonstrated the reluctance of clinicians to trust ML due to the inability to understand the underlying mechanisms of the prediction.^15,16 This research will test six different ML models, including interpretable ML (InterpretML using Explainable Boosting Machine (EBM) is a tree-based, cyclic gradient boosting Generalized Additive Model) and traditional models such as Support Vector Machine using Support Vector Classification (SVC) with Radial Bias Function Kernel.

Importance of trust and transparency

Trustworthiness in ML innovations is a priority for governments, researchers and clinicians; however, trust and confidence have been highlighted as significant barriers to the acceptance of ML within a clinical setting.¹⁵ The exclusive emphasis on demonstrating trust without ensuring reliability and validity during practice may lead to negative experiences for clinical users and hinder acceptance.¹⁶ It is essential that ML interventions be described in terms of competence, reliability and validity as would be expected of other clinical tools where the quality of care and the safety of a person are the highest priority.¹⁶ Clinicians should be presented with treatment recommendations that describe the validity and confidence of prediction, with the final decision for care always made by a clinician.¹⁶

Method

This study utilises a retrospective cohort analysis¹⁷ combined with predictive modelling using ML techniques¹⁸ and aims to assess the predictability of MH inpatient admissions using various ML models. The study was conducted in two public EDs within the Central Coast Local Health District of New South Wales, Australia. The population and sample included all ED presentations where a MH issue was recorded as the primary presenting problem. The primary data collection for this study involved reviewing existing records from January 1, 2016, to December 30, 2021 resulting in an initial dataset of 26,681 records. Missing records in this dataset were determined to be missing completely at random (MCAR), and the standard listwise deletion approach was applied.¹⁹ This method led to the removal of 515 rows and reduced the final dataset to (n = 26,166) records. A second round of data collection was conducted to determine if the patient's presentation was linked to current or past treatment with the service, which is referred to as “MH history” in the research. The study received ethical approval from the Hunter New England Human Research Ethics Committee, under the reference number 2022/ETH01597, on November 18, 2022. Additionally, site-specific approval was obtained from the Central Coast Local Health District on December 14, 2022. The requirement for informed consent to participate has been waived by the Hunter New England Human Research Ethics Committee, where it has been deemed that consent would be impossible or impracticable to obtain. The initial data investigation used SQL, Python, Excel, and Seaborne.⁴ Data were cleaned, and missing data was handled as per.⁴ Further information on the data collection and extraction methods can be found in⁴ and exploratory analysis of the data.⁴ Pearson's correlation coefficient²⁰ was used to determine the lineal/nonlinear nature of the data, the correlation coefficient and linear regression r-squared with a lineal/non-linear tolerance set at 0.01. The tolerance value represents the minimum threshold for the R-squared value of the linear regression model to be considered as indicating a linear relationship.²⁰ The R-squared value measures the proportion of the variance in the dependent variable explained by the independent variable(s). If the R-squared value is above the specified tolerance threshold, the code concludes that the relationship is linear; otherwise, it is considered non-linear. The resulting test concludes that the dataset is non-linear in nature due to the small number (11/120) of correlations.

This research will test six different types of ML models, Random Forest, XGBoost, CatBoost, k-Nearest Neighbours (kNN), Explainable Boosting Machine (EBM) as InterpretML, and Support Vector Machine using Support Vector Classification (SVC) with Radial Bias Function Kernel to provide a comprehensive evaluation due to the inherent differences in their underlying methodologies, strengths, and limitations. This approach ensures that various aspects of the data and the problem are addressed, leading to a more robust and well-rounded understanding of overall model performance in the context of the research.

Random forest

Random Forest is an ensemble learning method primarily used for classification and regression tasks. Developed by Breiman,²¹ it constructs a multitude of decision trees during training and outputs the mode of the classes for classification or the mean prediction for regression. Each tree in the forest is built using a subset of the data and a subset of features, which helps to ensure that the trees are not highly correlated. The randomness introduced in the selection of data samples and features leads to model diversity, thereby improving the overall prediction performance and robustness against overfitting compared to a single decision tree.²² The algorithm works by splitting nodes in each decision tree using the best among a subset of predictors randomly chosen at that node. This random selection of features leads to a decrease in variance without increasing bias, resulting in a model that performs well on unseen data. The method's efficiency and accuracy in handling large datasets and its ability to model complex interactions make it a popular choice in various applications, including healthcare and finance.²³

XGBoost

XGBoost, or eXtreme Gradient Boosting, is an efficient and scalable implementation of gradient boosting frameworks.²⁴ It is designed to be highly efficient, flexible, and portable. The model builds trees sequentially, where each new tree corrects the errors made by the previous ones. This iterative process aims to minimise a loss function by using a gradient descent algorithm. XGBoost incorporates a regularisation term in its objective function, which helps prevent overfitting, making it a robust choice for both classification and regression problems. It also supports parallel and distributed computing, which significantly speeds up the learning process, making it suitable for large datasets. Additionally, XGBoost provides functionalities for cross-validation and automatic handling of missing data, further enhancing its versatility and application in a wide range of ML tasks.²⁵

Catboost

CatBoost is another gradient boosting framework that stands out for its ability to handle categorical features natively.²⁶ Traditional ML algorithms often require categorical variables to be converted into numerical values, which can lead to information loss. CatBoost, developed by Yandex, addresses this by using an ordered boosting process and incorporating category-specific transformations to capture the information contained in categorical features effectively.²⁶ The model uses an innovative method of calculating leaf values during the boosting process, which helps in reducing the prediction shift caused by biased leaf value estimation. CatBoost's ability to handle categorical data without preprocessing steps, along with its robust regularisation techniques, make it a powerful tool for a variety of applications, including those with structured and unstructured data.²⁷

k-Nearest neighbours (kNN)

The k-nearest neighbours (kNN) algorithm is a simple, yet effective, instance-based learning method used for classification and regression. Proposed by Cover and Hart in 1967,²⁸ it is a non-parametric method that predicts the output for a query instance based on the majority class (for classification) or average outcome (for regression) of its k nearest neighbours in the feature space. One of the primary advantages of kNN is its simplicity and ease of implementation. However, it can be computationally expensive for large datasets because it requires calculating distances between the query instance and all other points in the dataset. Moreover, kNN is sensitive to the local structure of the data and the choice of k, where a small k can lead to overfitting, while a large k may result in underfitting. Despite these challenges, kNN is widely used in various fields, including pattern recognition, data mining, and image processing, due to its effectiveness in scenarios where the data distribution is unknown.²⁹

Interpretml

InterpretML³⁰ is a framework for creating interpretable models that can assist clinicians with making informed decisions about patient care using algorithms such as Explainable Boosting Machine (EBM), a tree-based, cyclic gradient boosting Generalized Additive Model (GAM). Traditional ML models, such as deep neural networks, can be very accurate but are often considered “black boxes” because it can be challenging to understand how they arrive at their predictions.³¹ This lack of transparency can be an obstacle in healthcare, where patients have the right to understand decision-making.³² InterpretML allows visualising models and their predictions via feature importance, partial dependence, and individual conditional expectation plots.³³ These plots can help identify the impact of each feature on the outcome and areas for improvement.³² For additional information pertaining to the use of InterpretML on this dataset, please refer to.³⁴

Support vector machines for classification

Support Vector Machines (SVM) are ML algorithms that can be used for classification, regression and clustering.^35–37 In the case of two-class classification, SVMs’ fundamental goal is to establish a hyperplane that maximises the margin of separation between the two classes, taking noise and outliers into account. By using kernel functions SVMs can produce non-linear decision boundaries and, with a suitable kernel, can also handle non-vectorial data. Due to their training based on constrained optimisation they can be applied also in cases where little data is available. This makes them a versatile tool for analysing the often relatively small but complex datasets that can be encountered in healthcare research. This research will use an SVM for classification using the SVC with a Radial Bias Function Kernel.

Data description

The ED presentations related to MH and suicidal behaviour between January 1, 2016, and December 30, 2021. The ED data were selected based on specific terms in the presenting problem, such as ‘%suici%’, ‘%MH%’, ‘%self harm%’, ‘%psych%’, or ‘%Mental%’. Additionally, data from electronic medical records were extracted to link the ED presentations with MH service data, indicating whether the individual received current or previous care. The dataset contained the following features: age, active/previous service user, presenting problem, facility identifier, first nations status, source of referral, marital status, model of arrival, gender, triage category, time of the day, day of the week, month, referred to on departure (including admission) and mode of separation.

The extracted dataset contained 26,681 records, with a relatively small proportion of missing data for various variables, such as ED source of referral, mode of separation, marital status, mode of arrival, and referred to on departure. Although small amounts of missing data can impact study results, the missing data in this case were considered missing completely at random (MCAR). The most straightforward approach, listwise or case deletion, was employed, resulting in the removal of entire missing records from the dataset. This approach reduced the final record count to 26,166, with a loss of 515 records (1.93%).³⁴ The final data set contained 26,166 records of people who presented to ED with MH as their primary presenting problem; however, there was a significant class imbalance in the target of admission (Yes, n = 5774, No, n = 20392). The baseline characteristics for these presentations are described in Table 1.

Table 1.

Description of data set based on admission.

Admitted	NO	%	YES	%	Total	%
n	20392	77.93%	5774	22.07%	26166
Age
0–11	391	1.92%	41	0.71%	432	1.65%
12–17	3646	17.88%	397	6.88%	4043	15.45%
18–24	4243	20.81%	851	14.74%	5094	19.47%
25–34	4158	20.39%	1153	19.97%	5311	20.30%
35–44	3456	16.95%	1177	20.38%	4633	17.71%
45–54	2401	11.77%	989	17.13%	3390	12.96%
55–64	1278	6.27%	490	8.49%	1768	6.76%
65–74	495	2.43%	335	5.80%	830	3.17%
75–84	251	1.23%	203	3.52%	454	1.74%
over 85	73	0.36%	138	2.39%	211	0.81%
Presenting Problem
Altered Mental Status	1013	4.97%	497	8.61%	1510	5.77%
Anxious	1607	7.88%	157	2.72%	1764	6.74%
Behavioural Disturbance	2125	10.42%	1041	18.03%	3166	12.10%
Depression	1509	7.40%	262	4.54%	1771	6.77%
Eating Disorder	18	0.09%	10	0.17%	28	0.11%
Hallucinations	579	2.84%	333	5.77%	912	3.49%
Memory Impairment	18	0.09%	10	0.17%	28	0.11%
Mental Health Problem	3076	15.08%	845	14.63%	3921	14.99%
Self Harm	2563	12.57%	509	8.82%	3072	11.74%
Suicidal Ideation	7884	38.66%	2110	36.54%	9994	38.19%
Mode of Arrival
Air Ambulance Service	2	0.01%	2	0.03%	4	0.02%
Community/Public Transport	31	0.15%	6	0.10%	37	0.14%
Internal Ambulance / Transport	4	0.02%	6	0.10%	10	0.04%
Internal Bed / Wheelchair	4	0.02%	4	0.07%	8	0.03%
No transport (walked in)	168	0.82%	22	0.38%	190	0.73%
Other eg Undertakers/Contractors	39	0.19%	21	0.36%	60	0.23%
Police/Correctional Services Vehicle	1425	6.99%	482	8.35%	1907	7.29%
Private vehicle	9689	47.51%	2221	38.47%	11910	45.52%
Retrieval (including NETS)	22	0.11%	6	0.10%	28	0.11%
State Ambulance Vehicle	9008	44.17%	3004	52.03%	12012	45.91%
Gender
Female	10391	50.96%	2857	49.48%	13248	50.63%
Indeterminate	1	0.00%	1	0.02%	2	0.01%
Male	9996	49.02%	2916	50.50%	12912	49.35%
Unknown	4	0.02%	0	0.00%	4	0.02%
Triage Category
Resuscitation (1)	214	1.05%	118	2.04%	332	1.27%
Emergency (2)	3952	19.38%	1767	30.60%	5719	21.86%
Urgent (3)	8385	41.12%	2543	44.04%	10928	41.76%
Semi Urgent (4)	7396	36.27%	1311	22.71%	8707	33.28%
Non Urgent (5)	445	2.18%	35	0.61%	480	1.83%
Mode of Separation
Admitted Left at own risk	25	0.12%	0	0.00%	25	0.10%
Admitted Transferred to another hospital	16	0.08%	0	0.00%	16	0.06%
Admitted and discharged as inpatient within ED	0	0.00%	5774	100.00%	5774	22.07%
Admitted to ward / inpatient unit	198	0.97%	0	0.00%	198	0.76%
Admitted: To Critical Care Ward (including ICU1/ICU2/CCU/COU/NICU)	98	0.48%	0	0.00%	98	0.37%
Admitted: Via Operating Suite	8	0.04%	0	0.00%	8	0.03%
Departed: Did not wait	1754	8.60%	0	0.00%	1754	6.70%
Departed: Left at own risk	139	0.68%	0	0.00%	139	0.53%
Departed: Transferred to another hospital w/out 1st being admitted to hospital transferred from	915	4.49%	0	0.00%	915	3.50%
Departed: Treatment completed	417	2.04%	0	0.00%	417	1.59%
Departed: for other Clinical Service Location	16822	82.49%	0	0.00%	16822	64.29%

Data transformations for modelling

The following steps were taken to prepare the data set for ML modelling. Logarithmic transformation was applied to the data to calculate the natural logarithm of each element, allowing for a compressed range of values.³⁸ The logarithmic transformation process is used to address issues related to skewed data distributions and variations in magnitude and is required when handling variables with varying dynamic ranges.³⁸ The pandas function pd.get_dummies() is employed with the pandas DataFrame³⁹ to perform one-hot encoding on categorical variables. This transformation results in the creation of binary columns indicating the presence or absence of specific categories within the original dataset. Notably, the parameter drop_first = True is specified, leading to the exclusion of one dummy column for each categorical variable to mitigate multicollinearity concerns. Scaling was performed using Standard Scaler from scikit-learn to standardise specific columns using to_scale within the pandas DataFrame. The fit_transform method is applied, wherein the scaler computes the mean and standard deviation necessary for standardisation and then transforms the selected columns. This process ensures that the numerical features in the specified columns have a standardised distribution with zero mean and unit variance.³⁹ A test/train split was executed using the train_test_split function from scikit-learn, with a specified test size of 20%. This division allocates 80% of the data for training the model and reserves the remaining 20% for evaluating the model's performance on unseen data. The choice of a 20% test size is a common practice in model evaluation.³⁹ This strategic split allows for robust training and assessment of the model's generalisation capabilities, providing valuable insights into its performance on new, previously unseen observations.⁴⁰ To address the class imbalance in the dataset, the RandomOverSampler from the imbalanced-learn library was employed with the parameter sampling_strategy='minority’ after the test, train split to ensure that the oversampled data is not present in the test dataset, which is not appropriate for testing the model and leaves the final testing data unchanged. This oversampling technique focuses on the minority class, generating synthetic samples to balance the class distribution.⁴¹ The fit_resample method is then applied to the feature matrix and target variable, efficiently handling the oversampling process. Implementing random oversampling on the minority class mitigates the impact of class imbalance, preventing model bias towards the majority class and enhancing overall predictive performance.⁴¹

Machine learning models

Given the previous indication of the nature of the lineal data, it was decided to train both linear and non-linear models on the data and compare their performance. Models were built using InterpretML and SVC, GridsearchCV was used to establish parameters, with the resulting models being evaluated using F1, recall (sensitivity), precision, a confusion matrix, receiver operating characteristic (ROC), negative predictive value (NPV), false positive rate, false discovery rate, and false negative rate. A confusion matrix is used in classification problems to evaluate the performance of a machine-learning model by summarising the outcomes of predictions³⁹ and represents the number of correct and incorrect predictions made by the ML model being tested. The matrix has four main components: True Positive (TP), which is an instance where the model correctly predicts the positive class; True Negative (TN), instances where the model correctly predicts the negative class. False Positive (FP), instances where the model incorrectly predicts the positive class (Type I error), and False Negative (FN), instances where the model incorrectly predicts the negative class (Type II error).⁴² The confusion matrix is useful for understanding the model's performance across different classes by providing insights into the model's strengths and weaknesses. Performance metrics such as accuracy, precision, recall (sensitivity), specificity, and the F1 score can be derived from these values.³⁹ The ROC curve is a graph that displays the relationship between the recall (sensitivity) and the false positive rate of a model for different probability thresholds.⁴² Recall also referred to as sensitivity or true positive rate (TPR), measures the proportion of actual positive instances correctly identified by the ML model. In contrast, the false positive rate (FPR) represents the ratio of negative instances incorrectly classified as positive. The ROC curve is plotted by comparing the TPR against the FPR at different threshold values.⁴² A perfect model would have a ROC curve that reaches the top-left corner of the graph (TPR = 1, FPR = 0), while an ineffective model would produce a curve close to the diagonal line (45-degree angle).³⁹ AUC-ROC (Area Under the ROC Curve) is a numerical metric that summarises the model's overall performance in distinguishing between positive and negative classes. A higher AUC-ROC indicates better model performance.³⁹ ROC curves and AUC-ROC are commonly used to assess model performance, especially in situations such as health care where sensitivity and specificity are both important.

Working with healthcare data presents several challenges, such as siloing, segmentation, privacy ethics, and health policy.⁴³ Compared to other sectors, such as finance, there are limitations on the size of the data available for use in healthcare. The data used in this study presents similar issues and only represents a small portion of presentations to the ED and lacks the depth of clinical narrative that is present in the wider system. This dataset's limited number of records poses a significant challenge for machine ML techniques, which typically perform better with larger datasets. However, this limitation is outweighed by the critical importance of accurate prediction in the MH ED context, where misclassifications can have severe consequences for patient well-being and safety. False negatives, where patients in need of admission are missed, can result in delayed treatment, worsening of conditions, or even life-threatening situations. False positives, where patients are incorrectly identified as requiring inpatient admission, can lead to unnecessary hospitalisations, increased healthcare costs, and potential emotional distress for patients and their families. Clinicians and developers must carefully balance these risks and consider the potential consequences of each type of misclassification. In practice, clinicians are trained to prioritise caution over taking risks, as the consequences of missing a patient in need of intensive treatment can be far more severe than the inconvenience and costs associated with an unnecessary admission. In this context, the use of specificity as an evaluation metric becomes more relevant than precision. While precision measures the proportion of correct positive predictions, specificity measures the proportion of negative instances that are correctly identified.⁴² In the MH ED setting, where false negatives are particularly concerning, specificity provides a more direct assessment of the model's ability to identify patients who do not require inpatient admission accurately. Specificity measures the proportion of true negatives that are correctly identified by the model, demonstrating the ability of the model to accurately identify patients who do not require inpatient admission. A high specificity value is indicative that the model is effective at correctly identifying and filtering out negative cases (those who can be safely discharged or treated in the community). The ROC curve offers a powerful tool for clinicians to evaluate and select appropriate classification models for the MH ED context. By plotting the true positive rate (sensitivity) against the false positive rate (1 - specificity) for classification thresholds, the ROC curve allows clinicians and developers to visualise the model's ability to correctly identify positive cases (patients requiring admission) and its tendency to generate false positives. By carefully considering the ROC curve and the associated trade-offs, clinicians and developers can select a classifier that balances the risks of false positives and false negatives in a manner that aligns with their clinical objectives and the specific context of the MH ED.

Results

The performance of ML models in healthcare is critical, as these models assist in clinical decision-making by predicting patient outcomes. In evaluating the six ML models (EBM InterpretML, Random Forest, XGBoost, CatBoost, k-Nearest Neighbors (kNN), and Support Vector Machine using Support Vector Classification (SVC) with Radial Bias Function Kernel) a variety of metrics are utilised, including accuracy, AUC-ROC, precision, recall, F1 score, specificity, NPV, false positive rate, false discovery rate, false negative rate and Matthews Correlation Coefficient (MCC). The overall results can be found in Table 2, with the overall classifier accuracy demonstrated in Figure 1 and the overall classifier ROC comparison is in Figure 2. The following section breaks down the performance of each classifier model performance.

Figure 1.

Classifier accuracy.

Figure 2.

Classifier ROC.

Table 2.

Model performance.

	EBM InterpretML	Random Forest	XGBoost	Catboost	kNN	SVC
Accuracy	0.7883	0.7776	0.7814	0.7900	0.7553	0.7780
ROC/AUC	0.7167	0.6820	0.6986	0.7129	0.5950	0.5960
TN	4001	3937	3905	3966	3773	4068
FP	78	142	174	113	306	11
FN	1030	1022	970	1004	975	1151
TP	125	133	185	151	180	4
Recall Sensitivity	0.1082	0.1152	0.1602	0.1307	0.1558	0.0035
Precision	0.6158	0.4836	0.5153	0.5720	0.3704	0.2667
F1	0.1841	0.1860	0.2444	0.2128	0.2194	0.0068
Specificity	0.9809	0.9652	0.9573	0.9723	0.9250	0.9973
NPV	0.7953	0.7939	0.8010	0.7980	0.7947	0.7795
False Positive Rate	0.0191	0.0348	0.0427	0.0277	0.0750	0.0027
False Discovery Rate	0.3842	0.5164	0.4847	0.4280	0.6296	0.7333
False Negative Rate	0.8918	0.8848	0.8398	0.8693	0.8442	0.9965
MCC	0.1914	0.1453	0.1928	0.1952	0.1154	0.0059

EBM InterpretML achieves an accuracy of 0.7883, MCC of 0.1914 and a ROC/AUC of 0.7167 (Table 2, Figures 3 and 4), indicating moderate discriminative power. The model shows a recall of 0.1082 and a precision of 0.6158, reflecting a relatively low sensitivity but moderate precision. The F1 score of 0.1841 indicates a need for improvement in balancing recall and precision. The model's specificity of 0.9809 suggests excellent performance in identifying true negatives, while the false positive rate is low at 0.0191. However, the false discovery rate is 0.3842, and the false negative rate is high at 0.8918, pointing to challenges in correctly identifying positive cases.

Figure 3.

ROC of EBM interpretML.

Figure 4.

Confusion matrix of EBM interpretML.

Random Forest presents a slightly lower accuracy of 0.7776, MCC of 0.1453 and a ROC/AUC of 0.6820 (Table 2, Figures 5 and 6). It achieves a recall of 0.1152 and a precision of 0.4836, resulting in an F1 score of 0.1860. The specificity is 0.9652, demonstrating good capability in correctly identifying true negatives, although the false positive rate is slightly higher at 0.0348. The false discovery rate of 0.5164 and the false negative rate of 0.8848 indicate a significant number of misclassified positive cases.

Figure 5.

ROC random forest.

Figure 6.

Confusion matrix of RF.

XGBoost shows an accuracy of 0.7814, MCC of 0.1928 and a ROC/AUC of 0.6986 (Table 2, Figures 7 and 8). The model's recall is 0.1602, and its precision is 0.5153, leading to an F1 score of 0.2444. Its specificity is 0.9573, with a false positive rate of 0.0427. The false discovery rate is 0.4847, while the false negative rate is 0.8398, reflecting moderate challenges in detecting true positive cases.

Figure 7.

ROC XGBoost.

Figure 8.

Confusion Matrix of XGBoost.

CatBoost stands out with the highest accuracy of 0.7900, and MCC of 0.1952 and a ROC/AUC of 0.7129 (Table 2, Figures 9 and 10), indicating superior overall performance. Its recall is 0.1307, and precision is 0.5720, resulting in an F1 score of 0.2128. The specificity of 0.9723 and the false positive rate of 0.0277 are commendable, showing a low rate of false positives. However, the false discovery rate is 0.4280, and the false negative rate is 0.8693, indicating room for improvement in positive case detection.

Figure 9.

ROC CatBoost.

Figure 10.

Confusion matrix of CatBoost.

k-Nearest Neighbours (kNN) has an accuracy of 0.7553, MCC of 0.1154 and an ROC/AUC of 0.5950 (Table 2, Figures 11 and 12), the lowest among the models evaluated. The recall is 0.1558, and precision is 0.3704, resulting in an F1 score of 0.2194. The specificity is 0.9250, with a higher false positive rate of 0.0750. The model also exhibits a high false discovery rate of 0.6296 and a false negative rate of 0.8442, indicating substantial issues in identifying positive cases.

Figure 11.

ROC kNN.

Figure 12.

Confusion matrix of kNN.

Finally, the SVC model, with an accuracy of 0.778, MCC of 0.0059 and an ROC/AUC of 0.5960 (Table 2, Figures 13 and 14), presents unique characteristics. It achieves a recall of only 0.0035, indicating a very low sensitivity, and a precision of 0.2667, resulting in a minimal F1 score of 0.0068. However, the specificity is extremely high at 0.9973, and the false positive rate is very low at 0.0027, suggesting the model is highly conservative in identifying positives. The false discovery rate is 0.7333, and the false negative rate is 0.9965, reflecting severe challenges in positive case detection.

Figure 13.

ROC of SVC.

Figure 14.

Confusion matrix of SVC.

Discussion

CatBoost emerges as the model with the highest accuracy and balance between recall and precision, although improvements are needed in positive case detection. Models like EBM InterpretML and Random Forest demonstrate good specificity but struggle with recall and false negatives, which are critical in healthcare applications, however unlike Random Forest, EBM InterpretML does offer the ability to provide further insight into its recommendation, making it suitable for general healthcare applications where transparency is as important as the prediction provided. The comparison between the EBM InterpretML, CatBoost, and Random Forest models illustrates the trade-offs between interpretability and predictive performance in healthcare applications. EBM InterpretML exhibits a moderate level of accuracy at 0.7883, with a balanced distribution between true positives and false positives, indicating moderate effectiveness in both positive and negative predictions. Its primary advantage lies in its interpretability, making it a suitable choice for applications where understanding the decision-making process is crucial. However, EBM InterpretML's lower recall (sensitivity) of 0.1082 compared to the CatBoost model suggests a tendency to miss more positive cases. This is an important consideration in the context of care delivery, where failing to identify patients in need of intervention could have significant consequences. The model's precision is 0.6158, and it achieves a specificity of 0.9809, indicating excellent performance in identifying true negatives.

The CatBoost model demonstrates a higher ability to correctly identify positive cases, such as patients requiring admission to a mental health inpatient ward. It achieves the highest accuracy among the evaluated models at 0.7900, indicating strong overall performance. CatBoost's recall of 0.1307 compared to EBM InterpretML suggests it is more effective at capturing true positive cases, reducing the likelihood of missing patients who require critical care. This makes CatBoost a compelling choice for scenarios where the priority is to maximise the detection of positive cases, even if it means sacrificing some interpretability. CatBoost's precision is 0.5720, with a specificity of 0.9723, reflecting its capability to avoid false positives while effectively identifying true positives.

In contrast, the Random Forest model presents a balanced approach with its accuracy of 0.7776 and recall of 0.1152. While it does not achieve the highest sensitivity, it performs reasonably well across various metrics, offering a compromise between interpretability and predictive power. Random Forest's recall is slightly higher than EBM InterpretML, suggesting a better ability to capture positive cases, although it still lags behind CatBoost in this regard. The model's precision is 0.4836, and its specificity is 0.9652, indicating good performance in identifying true negatives, which is crucial for minimising unnecessary interventions.

The choice between EBM InterpretML, CatBoost, and Random Forest should be informed by specific clinical priorities. If interpretability is crucial, EBM InterpretML is advantageous; however, CatBoost offers superior performance in capturing true positive cases if the goal is to maximise the identification of patients requiring admission. With its balanced approach, the Random Forest model provides a viable middle ground for situations where both interpretability and predictive accuracy are important considerations. The choice of model should be guided by the specific clinical context and the importance of minimising false negatives or false positives, depending on the consequences of each error type. The results of this study demonstrate the effectiveness of ML models in the prediction of admissions to a MH inpatient ward based on routine data collected at the triage stage in the ED. This finding aligns with previous research^10,12 that demonstrates the potential of ML models to enhance decision-making processes and improve patient outcomes in the ED setting.

While some of the models exhibit positive predictive performance, ensuring transparency and interpretability is crucial for gaining trust and acceptance among clinicians and patients.¹⁵ This is where the EBM InterpretML model offers a strong overall balance, making it suitable for applications such as healthcare, offering clinicians deeper insight into the data and how recommendations are made. InterpretML facilitates the interpretation of ML models by identifying complex interactions through methods such as the overall importance graph.³³ This graph uses the mean absolute score (MAS) to assess each feature's impact on model predictions.⁴⁴ MAS is calculated by permuting the values of each feature in the dataset and comparing the model's predictions before and after permutation. Features with higher MAS are deemed more important to the model's predictions. The overall importance graph ranks features by their MAS, allowing for quick identification of those with the most significant impact. Additionally, InterpretML uses Log odds to interpret the relationship of each element to outcomes visually. Log odds represent the probability of an event as a logarithmic function of its odds.⁴⁵ The log odds ratio (OR) compares the odds of an event occurring between two groups, such as treatment and control. By integrating Log OR, the model can assess the impact of each predictor, compare outcomes across different groups, and provide a unified probability scale.

Figures 15 and 16 provide a demonstration of the capabilities of the InterprtML package. Figure 15 provides the overall global importance of the features of the model and their relative predictive power and influence on the model. Figure 16 is a demonstration of one prediction from the unseen test dataset. In this case, the most heavily weighted features that the model used to make a recommendation were the individual's need for admission based on their triage category, the fact that they are an active client of the service, their age, the facility they presented to, their marital status (divorced), and their presenting problem of suicidal ideation. This output (Figure 16) enables the clinician to examine the recommendation (in this case, admit) to see if there are confounding factors present that would require further assessment and time with the individual to assess their needs. Based on this information, it may be possible enhance clinical practice by offering quicker assessments or by providing less experienced clinicians with the necessary tools to guide their practice in identifying important presentation factors that require further exploration. Visualisations such as Figure 16 can assist clinicians in understanding the underlying factors contributing to the model's predictions, fostering trust and enabling more informed decision-making processes. This transparency is crucial in any ML innovation gaining acceptance among healthcare professionals.^15,16 Further detail regarding InterpretML and its use in this data asset can be found at Higgins, Chalup³⁴

Figure 15.

EBM overall importance.

Figure 16.

EBM individual prediction.

Limitations

One limitation of this study is the reliance solely on the initial data collected during the patient's presentation to triage in the ED and may not be representative of all MH presentations due to issues such as triage coding or alternative admission pathways. Furthermore, while this data provides valuable insights into the patient's condition at the time of triage, it may not capture the complete picture of the patient's overall health status, medical history, and various psychosocial factors that could influence the need for admission to a MH inpatient ward. By limiting the data sources to the initial triage information, this research may overlook crucial details that could enhance the predictive capabilities of the ML models. Integrating additional data sources, such as electronic medical records (EMRs), patient-reported outcomes, and social determinants of health, could potentially provide a more comprehensive and holistic assessment of the patient's needs. EMRs contain a wealth of information, including past medical histories, diagnoses, treatments, and laboratory results, which could offer valuable context for the ML models to consider.¹⁰ Furthermore, incorporating patient-reported outcomes, such as questionnaires or self-assessments, could provide insights into the patient's subjective experiences, symptoms, and overall well-being, which may not be fully captured in the initial triage data. It is important to note that the data used in the development of these models was from January 1, 2016, to December 30, 2021. This period included the COVID-19 pandemic, which included two major lockdowns⁴⁶ for the region in question and is noted as a period of increased MH stress for Australian people.⁴⁷ This must be taken into consideration and may limit the generalisability and applicability of the study's findings to diverse patient populations and healthcare settings. Finally, it must be noted that models should ideally be tested on an external dataset to assess generalisability. This is a broader issue in the development of ML models, especially in healthcare, as the absence of open data and code in many studies makes it difficult to reproduce and validate findings.^48,49 This lack of transparency also limits the ability to compare results across studies using varying data labels and evaluation metrics.⁴⁸

Recommendations

R1 Further development and testing of the InterpretML model in a CDSS to aid triage decision-making processes in the ED is required before any system is deployed using the methods discussed. However, the model's predictive performance has demonstrated an ability that can assist clinicians in identifying patients who require admission to a MH inpatient ward, potentially reducing wait times and improving the overall quality of care. This approach can enhance trust and acceptance among healthcare professionals, addressing the concerns^15,16 raised regarding the barriers to ML adoption in clinical applications.

R2 Further research is needed to validate the CatBoost, Random Forest and the InterpretML model's performance across multiple healthcare settings and diverse patient populations. This should include continuous monitoring and refinement of the model, which are essential to ensure its reliability and validity in real-world clinical contexts.¹⁶ This should include the development of a comprehensive implementation plan that addresses ethical considerations, data privacy, and potential biases. To ensure a transparent and inclusive process, stakeholders, such as clinicians, patients, and policymakers, must be present throughout the development cycle.¹⁵

R3 Future research could explore the integration of multiple data sources, including EMRs, patient-reported outcomes, and social determinants of health, into the ML modelling process. By leveraging a more comprehensive dataset, the ML models may better capture the complexities of MH conditions and provide more informed and tailored predictions for admission to MH inpatient wards. Furthermore, real-time data collection and monitoring would enable the continuous evaluation and refinement of the ML models’ performance. By tracking the accuracy of predictions against actual outcomes, researchers and clinicians can identify areas for improvement and make necessary adjustments to the models, ensuring their long-term reliability and validity.

R4 Further research is required to understand the barriers to AI integration with clinicians and explore their concerns about the interpretability and transparency of ML models. Tools such as the General Attitudes towards Artificial Intelligence Scale (GAAIS) survey⁵⁰ may help to better understand the apprehensions or negative attitudes towards AI among MH clinicians and help guide the development of targeted educational and training initiatives to address these concerns. This would address the clinician's need to understand the model's decision-making process, thereby increasing their trust and acceptance of the ML technology.

R5 Real-time data collection and monitoring are crucial for implementing and continuously improving ML models. The models can be regularly updated and refined by capturing patient data as it becomes available, ensuring they reflect the most current trends and patterns in MH presentations and outcomes. It is recommended that a robust data collection and monitoring system that enables real-time capture of patient data in the ED and MH inpatient settings be developed. This system should integrate seamlessly with the existing EMR infrastructure, facilitating efficient and accurate data extraction for the ML models.

Conclusion

This research has demonstrated the potential of Machine Learning in predicting admissions to a MH inpatient ward based on routine data collected at the triage stage in the ED. The positive predictive performance of the models, combined with the interpretability provided by the InterpretML framework, offers a promising approach to enhancing decision-making processes and optimising resource allocation in MH care. However, the successful implementation of these ML models in clinical practice requires a multifaceted approach that addresses concerns about trust and transparency, ethical considerations, and continuous validation and refinement. By collaborating with subject matter experts and incorporating feedback from clinicians and patients, these ML innovations can be effectively integrated into existing healthcare systems, ultimately improving patient outcomes and enhancing the quality of MH care delivery.

Footnotes

Acknowledgements

Authors would like to acknowledge the support of Central Coast Local Health District

Authorship statement

OH – concept development, project design, data collection, data analysis, manuscript preparation.

RW – concept development, contribution to manuscript, supervision of project.

SC – data analysis, project design, data collection, data analysis, manuscript contribution, supervision of project

Consent to participate

The requirement for informed consent to participate has been waived by the Hunter New England Human Research Ethics Committee, where it has been deemed that consent would be impossible or impracticable to obtain.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethics approval

Ethical approval was granted by the Hunter New England Human Research Ethics Committee 2022/ETH01597. Central Coast Local Health District granted site-specific approval 2022/STE03296

Funding

Partial financial support was received from NSW Ministry of Health as part of the Towards Zero Suicides initiative.

Guarantor statement

OH is the guarantor who has taken full responsibility for this article, including the accuracy and appropriateness of the reference list.

ORCID iDs

Oliver Higgins

Rhonda L. Wilson

References

World Health Organization. World mental health report: transforming mental health for all. Geneva: World Health Organization, 2022.

Australian Institute of Health and Welfare. Suicide & self-harm monitoring, https://www.aihw.gov.au/suicide-self-harm-monitoring/data/data-downloads (2023, accessed 11/08/2023 2023).

Higgins

Sheather-Reid

Chalup

, et al. Disproportionate mental health presentations to Emergency Departments in a coastal regional community in Australia of First Nation people. Int J Ment Health Nurs 2024: 1–8. DOI: https://doi.org/10.1111/inm.13362

Higgins

Sheather-Reid

Chalup

, et al. Sociodemographic factors and presentation features of individuals seeking MH care in EDs: a retrospective cohort study. Int J Ment Health Nurs 2024: 1–12. DOI: https://doi.org/10.1111/inm.13414

Clarke

Dusome

Hughes

. Emergency department from the mental health client’s perspective. Int J Ment Health Nurs 2007; 16: 126–131.

Fatovich

Hirsch

. Entry overload, emergency department overcrowding, and ambulance bypass. Emerg Med J 2003; 20: 406–409.

Morphet

Innes

Munro

, et al. Managing people with mental health presentations in emergency departments—A service exploration of the issues surrounding responsiveness from a mental health care consumer and carer perspective. Australas Emerg Nurs J 2012; 15: 148–155.

Pearlmutter

Dwyer

Burke

, et al. Analysis of emergency department length of stay for mental health patients at ten Massachusetts emergency departments. Ann Emerg Med 2017; 70: 193–202.e116.

Fernandes

Vieira

Leite

, et al. Clinical decision support systems for triage in the emergency department using intelligent systems: a review. Artif Intell Med 2020; 102: 101762.

10.

Hong

Haimovich

Taylor

. Predicting hospital admission at emergency department triage using machine learning. PLOS ONE 2018; 13: e0201016.

11.

Iorfino

Carpenter

, et al. Predicting self-harm within six months after initial presentation to youth mental health services: a machine learning study. PLOS ONE 2020; 15: e0243467.

12.

Raita

Goto

Faridi

, et al. Emergency department triage prediction of clinical outcomes using machine learning models. Crit Care 2019; 23: 64.

13.

Reale

Novak

Robinson

, et al. User-Centered design of a machine learning intervention for suicide risk prediction in a military setting. AMIA Annu Symp Proc 2020; 2020: 1050–1058.

14.

Gianfrancesco

Tamang

Yazdany

, et al. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med 2018; 178: 1544–1547.

15.

Higgins

Short

Chalup

, et al. Artificial intelligence (AI) and machine learning (ML) based decision support systems in mental health: An integrative review. Int J Ment Health Nurs 2023; 32: 966–978.

16.

Higgins

Chalup

Wilson

. Artificial intelligence in nursing: trustworthy or reliable? J Res Nurs 2024; 29: 143–153.

17.

Liamputtong

. Research Methods and Evidence-Based Practice. Docklands: Oxford University Press Australia & New Zealand, 2021.

18.

Kuhn

. Applied Predictive Modeling. New York, NY: Springer, 2013.

19.

Kang

. The prevention and handling of the missing data. Korean J Anesthesiol 2013; 64: 402–406.

20.

Francis

Garing

. Foundations of Statistics, Pearson Original Edition. Melbourne, Australia: Pearson Education Australia, 2015.

21.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

22.

Liaw

Wiener

. Classification and regression by randomForest. R News 2002; 2: 18–22.

23.

Biau

Scornet

. A random forest guided tour. Test 2016; 25: 197–227.

24.

Chen

Guestrin

Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016, pp.785–794.

25.

Chen

Benesty

, et al. Xgboost: extreme gradient boosting. R Package Version 04-2 2024; 1: 1–4.

26.

Prokhorenkova

Gusev

Vorobev

, et al. CatBoost: unbiased boosting with categorical features. Adv Neural Inf Process Syst 2018; 31.

27.

Dorogush

Ershov

Gulin

. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:181011363 2018.

28.

Cover

Hart

. Nearest neighbor pattern classification. IEEE Trans Inf Theory 1967; 13: 21–27.

29.

Hastie

Tibshirani

Friedman

. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer, 2013.

30.

InterpretML Documentation. InterpretML Documentation, https://interpret.ml/docs/ (2023, accessed 06/04/2023 2023).

31.

Rai

. Explainable AI: from black box to glass box. J Acad Mark Sci 2019; 48: 137–141.

32.

Molnar

. Interpretable Machine Learning. Leanpub, 2020.

33.

Nori

Jenkins

Koch

, et al. Interpretml: A unified framework for machine learning interpretability. arXiv preprint arXiv:190909223 2019.

34.

Higgins

Chalup

Wilson

. Machine learning model reveals determinators for admission to acute mental health wards from emergency department presentations. Int J Ment Health Nurs 29 August 2024; 2024: 1–16.

35.

Ben-Hur

Horn

Siegelmann

, et al. Support vector clustering. J Mach Learn Res 2001; 2: 125–137.

36.

Schölkopf

Smola

. Learning with kernels: support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press, 2002.

37.

Vapnik

. The nature of statistical learning theory. New York, NY: Springer Science & Business Media, 2013.

38.

Dong

Liu

. Feature engineering for machine learning and data analytics. First edition. ed. Boca Raton, FL: CRC Press/Taylor & Francis Group, 2018.

39.

Paper

. Hands-on Scikit-Learn for Machine Learning Applications Data Science Fundamentals with Python. 1st 2020. ed. Berkeley, CA: Apress, 2020.

40.

Chen

Zhang-James

Barnett

, et al. Predicting suicide attempt or suicide death following a visit to psychiatric specialty care: a machine learning study using Swedish national registry data. PLoS Med 2020; 17: e1003416.

41.

Zolbanin

Crosby

, et al. A predictive analytics-based decision support system for drug courts. Inf Syst Front 2019; 22: 1323–1342.

42.

Chicco

Tötsch

Jurman

. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Min 2021; 14: 13.

43.

Consoli

Recupero

Petkovic

. Data science for healthcare. Switzerland: Springer, 2019.

44.

InterpretML Documentation. Overall Importance, https://interpret.ml/docs/overall_importance/ (2023, accessed 06/04/2023 2023).

45.

Tabachnick

Fidell

Ullman

. Using Multivariate Statistics. London: Pearson, 2018.

46.

NSW Health. Public Health Orders and restrictions, https://www.health.nsw.gov.au/Infectious/covid-19/Pages/public-health-orders.aspx (2022, accessed 09/03/2024 2024).

47.

Melbourne Institute. Coping with COVID-19: rethinking Australia, Chapter 4: Heightened Mental Distress: Can Addressing Financial Stress Help. Melbourne: Melbourne Institute, 2020.

48.

Malgaroli

Hull

Zech

, et al. Natural language processing for mental health interventions: a systematic review and research framework. Transl Psychiatry 2023; 13: 309.

49.

Chandler

Foltz

Elvevåg

. Improving the applicability of AI for psychiatric applications through human-in-the-loop methodologies. Schizophr Bull 2022; 48: 949–957.. 2022/06/01

50.

Schepman

Rodway

. Initial validation of the general attitudes towards artificial intelligence scale. Computers in Human Behavior Reports 2020; 1: 100014.

Using machine learning to assist decision making in the assessment of mental health patients presenting to emergency departments

Abstract

Objective

Methods

Results

Conclusion

Keywords

Introduction

Importance of trust and transparency

Method

Random forest

XGBoost

Catboost

k-Nearest neighbours (kNN)

Interpretml

Support vector machines for classification

Data description

Data transformations for modelling

Machine learning models

Results

Discussion

Limitations

Recommendations

Conclusion

Footnotes

Acknowledgements

Authorship statement

Consent to participate

Declaration of conflicting interests

Ethics approval

Funding

Guarantor statement

ORCID iDs

References