Sage Journals: Discover world-class research

Abstract

Learning from patient safety incident reports is a vital part of improving healthcare. However, the volume of reports and their largely free-text nature poses a major analytic challenge. The objective of this study was to test the capability of autonomous classifying of free text within patient safety incident reports to determine incident type and the severity of harm outcome. Primary care patient safety incident reports (n=31333) previously expert-categorised by clinicians (training data) were processed using J48, SVM and Naïve Bayes.

The SVM classifier was the highest scoring classifier for incident type (AUROC, 0.891) and severity of harm (AUROC, 0.708). Incident reports containing deaths were most easily classified, correctly identifying 72.82% of reports. In conclusion, supervised ML can be used to classify patient safety incident report categories. The severity classifier, whilst not accurate enough to replace manual processing, could provide a valuable screening tool for this critical aspect of patient safety.

Keywords

incident reporting machine learning natural language processing patient safety quality improvement

Background and significance

Harm associated with healthcare is the third leading cause of death in the United States.¹ It affects over 10% of patients in hospital^2,3 and 2%–3% of those seen in primary care settings.⁴ A patient safety incident is said to occur when a situation that could have resulted, or did result, in avoidable harm to a patient is observed during healthcare delivery.⁵ Many of these incidents can involve life and death moments.

Healthcare has a poor record of creating actionable learning for quality improvement from patient safety incident reports.⁶ One important reason for this is that the most important information is described in the free-text part of an incident report. While every incident report is read and actioned locally, it is often not until they are aggregated that patterns become apparent. To aggregate these data though, it must be categorised in the same manner and to the same standard. A traditional approach of establishing a classification framework, creating categories and rules for applying them and then training coding clerks is invariably defeated by the logistics. For example, in England and Wales, over 100,000 patient safety incident reports are submitted by frontline clinical staff every month.⁷ On a national level, only a small proportion of patient safety incidents a year is ever analysed for causation.^8,9 This is a remarkable and troubling failure to use data that have already been collected to protect patients from harm and inform health system improvements. Rather than focusing decisions on which small minority of incidents to prioritise for analysis,¹⁰ a potential solution is natural language processing (NLP) used in conjunction with machine learning (ML). Together, they can convert unstructured free text into structured information autonomously.^11–15 Automatically and accurately assigning incident categories to incident reports would remove a major manual component of our current patient safety strategy on a national level.

A predeterminant of success of a supervised NLP implementation is the availability of large quantities of suitable training data from which the machine can learn¹¹ and which have been categorised by a domain expert.¹³ The recent Primary Care Patient Safety (PISA) study¹⁶ provided a unique corpus of primary care patient safety incident reports that had been read, categorised and coded by trained clinicians with expertise in patient safety and human factors.

Aim

This study aimed to test the capability of NLP/ML to classify unstructured free text within patient safety incident reports in two main themes: the incident category and harm severity. Each incident had been previously classified manually by an expert clinical and human factors team applying a classification framework that had been developed and validated by the research group.¹⁶ For each of these, the study sought to examine whether this could be achieved using just the unstructured free-text description of an incident report alone or whether the addition of structured categorical data (routinely collected as part of incident reports, such as specialty) improved the success of the autonomous classification.

Materials and methods

Classifiers

This study tested supervised ML classifiers, which use preexisting categorised data to derive learning.¹⁷ ML classifiers and techniques, which are able to classify text in documents, including within-patient safety incident reports, were identified through literature review. For each research question, three different ML classifiers were trained and subsequently evaluated – Naïve Bayes (NB), J48 and Support Vector Machine (SVM) with a polykernel. J48, NB and SVM were chosen since they have been successful in classifying medical incident reports in previous studies^18,19 and represent two distinct approaches to supervised ML, namely generative and discriminative models:²⁰

NB, a traditional generative classifier, has repeatedly demonstrated success in document classification tasks.¹⁵

J48’s decision tree structure provides an output that can be intuitively checked by domain experts with limited ML/NLP experience, allowing validation of the core logic of the tree.²¹

SVMs are discriminative classifiers which can cope well with training data consisting of large numbers of irrelevant features, as is the case with our text data. For this reason they have consistently outperformed other classifiers in a number of text categorisation tasks. They are also less prone to class imbalance problems.²²

Data sources

Patient safety incident reports are principally a free-text description²³ with additional categorical values such as location and time to add context. As part of the PISA study, the incidents have been categorised against a framework which was iteratively developed, validated and described in detail elsewhere (the PISA framework).⁷ Categories were applied using the Recursive Model of Incident Analysis, which ensures a chronological listing of incidents culminating in the event that directly harmed the patient.⁷ This leads to several levels of incident type – ‘primary’ denoting the incident directly impacting the patient, and then subsequent levels show the chain of events and factors that may have contributed to the incident. The PISA study¹⁶ also reclassified the severity rating. This was used in the present study.

Subset

Incidents that had been categorised as part of the PISA study and related studies were extracted from the database at Cardiff University. Those that had not been categorised by the main PISA incident and severity framework were removed, leaving 31,333 incident reports. There were 16 categorical variables and 4 free-text variables of data extracted for all incident reports (see Appendix 1). One free-text category was rarely completed and often with similar material and therefore was treated as categorical. The data were then split into two subsets – one including just the free-text ‘description of what happened’ field and another that included all the columns of data available to allow evaluation of whether the additional columns of categorical data assisted the classifier or not. The data were then converted into the Attribute-Relation File Format ready for importing into the ML software, as per previous studies in the area.^19,24

Dataset processing – characteristics

Figure 1 shows the class imbalance inherent at a high level, with 12,649/31,333 (40.4%) incidents in the ‘0 – Incorrect use of system’ category, compared to only 501 (1.6%) in the ‘10 – Other’ category. Therefore, the ‘0 – Incorrect use of system’ category and ‘6 – Medications’ categories were expanded to their second-level categories to reduce the class imbalance. Figure 2 shows the incident categories after the expansion.

Figure 1.

Number of incident reports by highest level incident categories (0–10).

Figure 2.

Number of incident reports by expanded incident categories (0.1–10).

Figure 3 shows the incident severity categories. There were 19,323 (61.7%) incidents that did not contain a severity category since they involved categories that were excluded from severity assessment during the PISA study (e.g. ‘no harm from primary care’ or ‘defensive reporting’).

Figure 3.

Number of incident reports by severity.

Software

Data were accessed and extracted through Microsoft SQL Server 2014, hosted on a secure Microsoft Windows Server 2012R2 instance at Cardiff University. Data were subsequently imported into the Waikato Environment for Knowledge Analysis (Weka) 3.8.0, an NLP and ML environment.²⁵ Weka is regularly used in healthcare document classification and has been used in previous studies into incident report classification.^18,19

Preprocessing incident reports

All free-text variables were first processed using the Weka’s StringToText filter to create a uniform representation for the reports. The following procedures were applied: NGram Tokenisation to produce trigrams, bigrams and unigrams.¹³ Unigrams represent individual terms (e.g. ‘patient’, ‘wound’, etc). Bigrams and trigrams are sequences of two or three terms (e.g. blood form, blood group, blood result, blood request, pressure ulcer), which were utilised to add an element of semantic processing as negation could also be added (e.g. not allergic), which is important for producing correct classifier rules.

Lower case normalisation was used to ensure that all forms of the same word were classified together (e.g. Patient, patient, pAtient, etc.)²⁶

‘Stopword’ filtering was used to exclude common words (such as he, she, it, why, we, etc), which hold no classification value.¹³ This technique is commonly used in information retrieval and NLP document classification implementations.²⁶ The ‘Rainbow’ stopwords list built into Weka was used²⁷

Term frequency filtering: Previous studies have excluded words that appear infrequently in the corpus,¹⁷ and due to the large size of the corpus it was decided that a minimum term frequency of 10 should be used.

Number of words in training set – 3000 words were kept as a balance between accuracy and resource (CPU/memory) use. Once the features to be represented were defined through the above procedures, uniform vectorial representations of each report were created where each feature was assigned a TFxIDF (Term Frequency, Inverse Document Frequency) score for that report. TFxIDF values are a function of the frequency of the term in the report, weighted according to the frequency of occurrence of the term in the dataset. Intuitively, these scores encode that the more often a term appears in a report, the more representative of that report it is, while the more reports it occurs in the less discriminative it is. TFxIDF scores can highlight relevant words when categorising large numbers of text documents.^26,28,29 For the TF to be accurate, all documents were normalised so longer incident reports did not skew the results.³⁰

Data security

All data were stored and accessed on a designated patient safety research computing cluster at Cardiff University, which has been designed with full NHS Information Governance Toolkit assurance for secondary use of data (IG Toolkit ID: 8WG65-PISA-CAG-0182). All data were stored and accessed in accordance with a data sharing agreement between NHS England and Cardiff University.

All data were anonymised by NHS England, compliant with the highest standards of information governance regulations, before being received by Cardiff University.

Training and testing the individual classifiers

Each classifier (e.g. SVM, NB) was trained and evaluated using a stratified 10-fold cross-validation technique built into Weka, ensuring the maximum amount of training material was available for the training while also ensuring rigour and reproducibility.^15,19

Statistics and analyses

Some types of incident reports are naturally reported more frequently (such as those related to medications and vaccines¹⁶) leading to a ‘class imbalance’. The area under the receiver operating curve (AUROC) was chosen as our primary outcome measure since it provides a single global measure of performance even in imbalanced data.³¹ Previous studies in ML/NLP have shown an AUROC of approximately >0.8 as being satisfactory and the closer to 1.0 the better.^32,33 However, to allow comparability with previous NLP and ML studies in this field, percentage correct and incorrect, precision, recall and F-measure are also reported. Weighted average values, as natively produced by Weka, are reported.

Ethical research considerations

The training data used for this current study were generated as part of the NIHR HS&DR study – ‘Characterising the nature of primary care patient safety incident reports in England and Wales: mixed methods study’–the PISA study, which analysed patient safety incident reports submitted to the National Reporting and Learning System from primary care in England and Wales between 2005 and 2013.⁷ The PISA study did not require Health Research Authority’s Research Ethics Committee (REC) approval and the Aneurin Bevan University Health Board research risk review committee waived the need for ethical approval (ABHB R&D Ref number: SA/410/13). Ethical approval for the current study was granted by the Swansea University REC (REF: 040816).

Results

Incident type classification – highest level incident categories (0–10)

Table 1 shows the results of the ML categorisation for the highest level incident categories. SVM had the highest AUROC, improving from 0.839 to 0.854 with the additional columns of data available (see Appendix 2).

Table 1.

Results of incident-type categorisation for the highest level incident categories.

Classifier	Correct (%)	Incorrect (%)	Cohen’sKappa	Precision	Recall	F-Measure	AUROC
With all variables of data available
SVM	64.111	35.889	0.523	0.629	0.641	0.633	0.854
J48	58.437	41.563	0.4227	0.542	0.584	0.550	0.736
NB	16.092	83.908	0.106	0.540	0.161	0.168	0.564
With only ‘Description of Incident’ available
SVM	61.845	38.155	0.490	0.602	0.618	0.607	0.839
J48	56.643	43.357	0.421	0.539	0.566	0.550	0.717
NB	12.22	87.780	0.074	0.512	0.122	0.112	0.544

AUROC: area under the receiver operating curve, NB = Naïve Bayes, SVM = Support Vector Machine.

Incident type classification – expanded incident categories (0.1–10)

Table 2 shows the results of the ML categorisation for the expanded incident categories. SVM consistently had the highest AUROC and was improved by the addition of the additional columns of data from 0.870 to 0.891. Neither J48 classifiers completed, aborting after 15 hours (see Discussion).

Table 2.

Results of incident-type categorisation for the expanded incident categories.

Classifier	Correct (%)	Incorrect (%)	Cohen’sKappa	Precision	Recall	F-Measure	AUROC
With all variables of data available
SVM	52.558	47.442	0.493	0.515	0.526	0.516	0.891
J48	Did not complete
NB	4.270	95.730	0.037	0.318	0.043	0.061	0.520
With only ‘Description of Incident’ available
SVM	46.855	53.145	0.4313	0.462	0.469	0.462	0.870
J48	Did not complete
NB	3.20	96.799	0.0277	0.302	0.032	0.045	0.515

AUROC: area under the receiver operating curve, NB: Naïve Bayes, SVM: Support Vector Machine.

Table 3 shows the AUROC for each individual incident category, when using the SVM classifier and all variables of data. Classes that achieved AUROC >0.98 included 0.4, 0.6, 0.7, 0.9 and 6.11. An AUROC of >0.8 was achieved by 17 of the 18 medication categories.

Table 3.

AUROC for expanded incident categories, with all columns of data available using the SVM classifier.

Class	AUROC	Number of incidents	Precision	Recall	F-Measure
0 – Incorrect use of system	0.851	157	0.291	0.102	0.151
0.1 – Defensive Reporting	0.773	357	0.313	0.084	0.132
0.2 – Irrelevant	0.726	2991	0.302	0.280	0.290
0.3 – Insufficient detail	0.791	3392	0.460	0.479	0.469
0.4 – Reporting deaths	0.983	422	0.616	0.737	0.671
0.5 – Incident not related to healthcare	0.929	2015	0.608	0.594	0.601
0.6 – Pressure ulcer	0.981	2398	0.757	0.786	0.772
0.7 – Healthcare-associated infection	0.993	64	0.593	0.547	0.569
0.8 – Complaints/coroner investigation	0.857	129	0.318	0.109	0.162
0.9 – Appropriate breach of confidentiality	0.990	724	0.782	0.822	0.801
1 – Administration	0.884	2734	0.432	0.533	0.477
2 – Documentation	0.932	855	0.537	0.483	0.509
3 – Referral	0.878	1532	0.356	0.337	0.346
4 – Diagnosis and assessment	0.895	1553	0.387	0.458	0.420
5 – Treatment and procedures	0.866	1876	0.399	0.418	0.408
6 – Medications and vaccines	0.807	58	0.000	0.000	0.000
6.1–Clinical treatment decision errors in the treatment decision-making process	0.806	238	0.291	0.067	0.109
6.2 – Wrong medication prescribed	0.948	1243	0.541	0.648	0.590
6.3 – Dispensing medication orders error	0.977	2686	0.785	0.842	0.812
6.4 – Administering medication errors	0.931	819	0.504	0.591	0.544
6.5 – Monitoring medications	0.944	152	0.478	0.283	0.355
6.6 – Adverse event (inc allergies)	0.945	321	0.533	0.505	0.518
6.7 – Drug omission	0.902	54	0.000	0.000	0.000
6.8 – Patient self-administered overdose	0.911	81	0.333	0.086	0.137
6.9 – Incorrect storage	0.967	44	0.400	0.091	0.148
6.10 – Medication timeliness	0.916	476	0.432	0.408	0.419
6.11 – Vaccines	0.988	534	0.806	0.801	0.804
6.12 – Medication unavailable	0.887	61	0.600	0.148	0.237
6.13 – Prescription handling	0.969	74	0.444	0.162	0.238
6.14 – Lost medication	0.973	40	0.450	0.225	0.300
6.15–Inappropriate medication supply	0.911	14	0.000	0.000	0.000
6.16–Unsuitable medication taken by patient	0.806	3	0.000	0.000	0.000
6.17 – OTC medication	0.500	1	0.000	0.000	0.000
7 – Investigations	0.977	1473	0.776	0.777	0.776
8 – Communication	0.840	510	0.198	0.159	0.176
9 – Equipment	0.899	751	0.461	0.379	0.416
10–Other	0.870	501	0.342	0.188	0.242

AUROC: area under the receiver operating curve, SVM: Support Vector Machine.

Table 3 also shows that the number of incident reports in a category is not necessarily proportional to AUROC. For example, category 6.3 has 2686 incidents, AUROC 0.977, but category 6.14 has only 40 incidents but an AUROC of 0.973. In addition, some categories had high numbers of incident reports but low AUROC such as category 0.3, which had 3392 incident reports but an AUROC of only 0.791.

Severity classification

Table 4 shows the results for severity classification. SVM achieved the highest AUROC at 0.708 with all columns of data, although this was not above our threshold for accuracy. Figure 4 shows the confusion matrix for the SVM classifier for the expanded incident categories and has been coloured to demonstrate where the classifier has classified correctly and where it has failed. In the death category, it correctly identified 72.85% (225/309) of cases involving death compared to only 20.95% in the severe harm category.

Table 4.

Results of severity categorisation for the expanded incident categories.

Classifier	Correct (%)	Incorrect (%)	Cohen’sKappa	Precision	Recall	F-Measure	AUROC
With all variables of data available
SVM	64.371	35.627	0.448	0.643	0.644	0.643	0.708
J48	64.355	35.645	0.420	0.644	0.644	0.629	0.694
NB	20.900	79.001	0.113	0.589	0.209	0.276	0.573
With only ‘Description of Incident’ available
SVM	61.091	38.909	0.392	0.609	0.611	0.609	0.683
J48	58.943	41.058	0.359	0.585	0.589	0.587	0.647
NB	16.728	83.272	0.088	0.595	0.167	0.226	0.561

AUROC: area under the receiver operating curve, NB: Naïve Bayes, SVM: Support Vector Machine.

Figure 4.

Confusion matrix for severity classification for SVM with all columns of data available.

Discussion

This study has shown great promise for automatically analysing patient safety incidents and has achieved this in several incident categories. It has succeeded in accurately classifying the content of incident reports particularly in medication incidents (17/18 categories achieving an AUROC of >0.8) and in pressure ulcers (AUROC 0.981). We have also succeeded in identifying patients who have died, from the content of incident reports, correctly 72.82% of the time, which will provide a valuable safety net.

However, we have also shown that this method does not perform well when classifying the severity of harm of patient safety incident reports. While the so-called ‘bag of words’ approach yields limited success, this may be sufficient to serve as a safety net to ensure that important cases are not missed during review. This study has also highlighted the categories that need both further refining of their definitions and where additional categorised incident reports are needed to most efficiently improve and refine the classifier. For example, vaccine errors achieved an almost perfect AUROC of 0.988 – thus, further human classification would not improve this value considerably. In contrast, further training material for the category ‘8 – Communication’ (with an AUROC of 0.84 and only 510 reports) may improve its accuracy considerably.

We found that the number of incident reports is not proportional to the overall success of the categorisation. This is consistent with Ong et al. (2010).¹⁸ Potentially, once the classifier has ascertained the best words to identify an incident category, further reports do not add to its accuracy.

Certain categories were harder to classify autonomously than others. This is also true of incidents studied in the aviation industry.³⁴ This may be because certain categories have few specific terms that the algorithm can utilise to confidently discriminate. Conversely, certain categories which have very specific words, such as in pressure ulcers (category 0.6), where words such as ‘pressure’ and ‘grade’ are fairly unique in medicine to this topic, lead to highly accurate classifications. This has been highlighted in previous work.³² Similarly, since healthcare professionals write reports in very high-level technical language, it regularly contains abbreviations and acronyms which pose a further problem for the classifier.^35,36 More problematic for certain categories, such as ‘7 investigations’, where healthcare professionals are more likely to call a ‘full blood count’ an ‘FBC’ or a ‘positron emission tomography scan’ a ‘PET scan’ than in other domains such as communication. However, this can also be seen as a positive since terms that are specific to certain domains are ideal for a classifier. Nuances and ambiguity of language can lead to confusion for the classifier and this has been highlighted as a problem in other NLP/ML applications too.¹¹ The addition of spelling mistakes causes further issues for the classifier since it treats different spellings as different words and thus classifies them differently. This is regularly a problem in other NLP/ML studies.³⁵

However, although the number of words may not influence accuracy, when combined with our hardest task (computationally) – the expanded incident categories with 37 possible categories – it may explain why the J48 classifier failed. Decision trees are computationally expensive, needing large amounts of resource (processing and memory) and do not scale to large numbers of classes.³⁷ In this study, that led to the J48 classifier running out of memory before completing.

One key category which posed problems for the classifier is contained in the 0.2 category, 0.2.1 – no harm from primary care. This category is used where there is a patient safety incident but it was not caused by an act or omission by primary care. It is likely that the classifier correctly identifies these incidents as, for example, medication incidents but because it was caused by secondary care it is classified as ‘no harm from primary care’ by the PISA study. It is therefore seen as a misclassification, despite the classifier being technically correct.

Strengths and limitations

This study had several strengths. First, it was the first study of its kind to use UK primary care incident reports and moreover was the largest ML/NLP study of patient safety incident reports conducted that we are aware of. Second, it used more incident categories than any other study we are aware of, and it was the first of its kind to use not only the information from the reporter but also the expert-applied PISA classification system.⁷

There are several broad limiting factors for the overall performance of the study; however, often these were out of our (and any studies) control namely the original content of the incident reports, the PISA coding of the incident reports (and their sampling) and inherent limitations of the classifiers themselves.

As seen in other studies on incident analysis, clear definitions can be more important than the size of the training set from which the classifier has to learn.³⁴ Table 4 shows this clearly and here this study’s methodology may have limited the outcome of its classifier. The PISA classification was iteratively developed and contains over 350 different incident categories. It was decided at the outset that there were insufficient data to train a classifier on all 350 categories, due to its hierarchical structure, and therefore to focus on the highest level categories (0 to 10). While this seems at the outset to be simpler for the classifier, it may conversely lead to more confusion since large quantities of incident reports are now grouped by broad vague concepts such as ‘Medication incidents’, ‘Incorrect use of system’ and ‘Other’. The ‘Incorrect use of system’ category is the broadest, ranging from pressure ulcers through to defensive reporting. To assess if this had caused further confusion, the broadest categories – ‘Incorrect use of system’ and ‘Medication incidents’ – were broken down to their next level in the hierarchy which increased the AUROC despite increasing the number of categories from 11 to 31 and at the same time reducing the number of categories available in each category from which to train.

The classifiers used in this study were trained only on the final incident that has directly led to patient harm. However, a single report may contain several interconnected incidents that led to the final outcome. The classifier may correctly identify any number of incidents contained within the report, but if it does not choose the final/primary incident, it will technically get the category wrong. This will require further research. The ultimate category applied to an incident is often subject to much debate and scrutiny, often requiring a third party to cast the final vote.⁷ This is seen in numerous studies which used expert-categorised data to train their classifiers, where disagreement between experts was seen in up to 20% of cases.³⁸ Therefore, we should not expect every incident to have been categorised in exactly the same way due to there being several (albeit highly trained) coders in the original study.¹⁶

The ‘bag of words’ strategy is a simple and effective approach; however, structure from the text is lost and thus the semantic meaning.^12,18 Negation is lost (e.g. ‘no allergies’), which poses a major problem since it treats the word ‘allergies’ the same irrespective of the preceding terms and this has been shown to be a problem in other studies.³⁹ To compensate for this, bigrams and trigrams were utilised in this study which would have attempted to identify the above example. Another solution is to use a semantic processor which can analyse sentences in their entirety.¹³ However, even with this approach, sometimes the sentences on either side can affect the meaning of the sentence in question, so-called ‘cross-sentence correlation’, which can have a similar effect as negation.⁴⁰ Recent works with paragraph vectors have shown improvements on the bag of words model by up to 30%.⁴¹

Comparison with previous work

There has been little research conducted on the use of ML and NLP in automating incident report analysis in healthcare.¹⁸ There has been considerably more research and success with it in incident reports in aviation³⁴ and notable successes reported for text classification from verbal autopsies,⁴² which have several similarities with incident reports. Of those studies of safety reports in healthcare, Wong and Akiyama³² undertook a study of 227 Canadian medication incident reports and used a custom classifier based on logistic regression to achieve good accuracy in autonomously categorising incident type. Ong et al. (2010) performed a larger study of 972 incident reports in Australia by focusing on two types of patient safety incident: ‘inadequate clinical handover’ and ‘incorrect patient identification’.¹⁸ They used NB and SVM classifier with excellent results (accuracy up to 97.98% with SVM on patient identification incidents) but noted that the topics chosen had very specific words that the classifier could easily detect which probably lead to their good results.¹⁸ Gupta and Patrick¹⁹ undertook a larger study of 5448 Australian incident reports, including 13 categories of incident type and utilised NB, SVM and the J48 decision tree classifier. They have reported achieving good results in an online presentation; however, their detailed methodology has not been published making further comparison difficult.¹⁹ The largest work in the field (up until now) appears to have been undertaken in Japan, where 15,000 patient safety incident reports were clustered using cluster analysis to ascertain their incident type, but they did not provide statistical or numerical results.^23,43 A recent paper by Wang et al.⁴⁴ looked at using ML and NLP to categorise Australian incident reports. Their study used fewer incident categories and used a significantly smaller dataset than ours, and they too struggled to classify severity level. Wang et al also demonstrated the difference that using balanced datasets makes to the accuracy of the task, although since real-world incident report data are inherently imbalanced we did not choose to balance our dataset.

Recommendations for future work

This project is the largest attempt at classifying patient safety incident reports in primary care to date, but further research will be required to achieve the same results on secondary care data. Within the scope of the current dataset, future research could focus on examining incident reports in their entirety utilising semantic classifiers,¹² and whether sequences of incidents can be extracted, something that has been researched in airline incident report analysis.²⁹ Although the categorical data routinely collected with each report are often non-specific, as it improved our study’s performance, it would be prudent to further research how these data can be used to enhance incident report categorisation. Further work around J48, either using reduced categories or superior infrastructure, is required, since its ‘human readable’ output allows checking for plausibility by patient safety experts. Improving definitions and increased training examples of select categories will likely further improve the performance.

Conclusion

Converting unstructured data to structured data using NLP/ML is challenging across all subject domains.^13,40,45 However, the highly nuanced and technical nature of medical text adds a further dimension of complexity.⁴⁶ While this study shows that NLP/ML is not perfect and cannot yet replace manual review entirely,⁴⁷ it suggests that it can act as a safety net, identifying cases that lead to severe harm and death, which have been incorrectly classified. The ability to determine certain categories accurately can also assist reviewers in those areas to focus on cases that need manual review – saving money and time.¹⁴ It also opens up the possibility of clustering reports that are ‘near misses’ or ‘no harm’, which are currently too time-consuming to work on in healthcare, which is a key strategy used by the airline industry in their successful safety model.⁴⁸

Footnotes

Appendix 1

Appendix 2 Acknowledgements

Incident reports were originally coded for a project funded by the National Institute for Health Services and Delivery Research Program (project number 12/64/118), . The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The views and opinions expressed herein are those of the authors and do not necessarily reflect those of the National Institute for Health Research Services and Delivery Research Programme, the National Health Service, or the Department of Health.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The study was funded by the Division of Population Medicine, Cardiff University.

ORCID iD

Huw Prosser Evans

References

Makary

Daniel

Medical error – the third leading cause of death in the US. BMJ 2016; 353: i2139.

Vincent

Neale

Woloshynowych

Adverse events in British hospitals: preliminary retrospective record review. BMJ 2001; 322: 517–519.

Rafter

Hickey

Condell

, et al. Adverse events in healthcare: learning from mistakes. QJM 2015; 108: 273–277.

Panesar

deSilva

Carson-Stevens

, et al. How safe is primary care? A systematic review. BMJ Qual Safe 2016; 25: 544–553.

World Health Organization. Conceptual framework for the international classification for patient safety. Geneva: World Health Organization, 2010, pp. 1–153.

Stavropoulou

Doherty

Tosey

How effective are incident-reporting systems for improving patient safety? A systematic literature review. Milbank Q 2015; 93: 826–866.

Carson-Stevens

Hibbert

Avery

, et al. A cross-sectional mixed methods study protocol to generate learning from patient safety incidents reported from general practice. BMJ Open 2015; 5:e009079.

National Patient Safety Agency (NPSA). Seven steps to patient safety: full reference guide, http://www.nrls.npsa.nhs.uk/resources/?entryid45=59787&q=0%c2%acseven+steps+to+patient+safety%c2%ac (2004, accessed 15 June 2016).

National Patient Safety Agency (NPSA). Being open: communicating patient safety incidents with patients, their families and care, http://www.nrls.npsa.nhs.uk/resources/collections/being-open/?entryid45=83726 (2009, accessed 1 August 2016).

10.

Hibbert

Healey

Lamont

, et al. Patient safety’s missing link: using clinical expertise to recognize, respond to and reduce risks at a population level. Int J Qual Health Care 2016; 28: 114–121.

11.

Erhardt

Schneider

Blaschke

Status of text-mining techniques applied to biomedical text. Drug Discov Today 2006; 11: 315–325.

12.

Hirschberg

Manning

CD.

Advances in natural language processing. Science 2015; 349: 261–266.

13.

Kimia

Savova

Landschaft

, et al. An introduction to natural language processing: how you can get more from those electronic notes you are generating. Pediatr Emerg Care 2015; 31: 536–541.

14.

Melton

Hripcsak

Automated detection of adverse events using natural language processing of discharge summaries. J Am Med Inform Assoc 2005; 12: 448–457.

15.

Witten

Frank

Hall

MA.

Data mining: practical machine learning tools and techniques. 3rd ed. Burlington, MA: Morgan Kaufmann Publishers, 2011.

16.

Carson-Stevens

Hibbert

Williams

, et al. Characterising the nature of primary care patient safety incident reports in the England and Wales national reporting and learning system: a mixed-methods agenda-setting study for general practice. Health Serv Deliv Res 2016; 4: 1–76.

17.

Savova

Ogren

Duffy

, et al. Mayo clinic NLP system for patient smoking status identification. J Am Med Inform Assoc 2008; 15: 25–28.

18.

Ong

Magrabi

Coiera

Automated categorisation of clinical incident reports using statistical text classification. Qual Saf Health Care 2010; 19: e55.

19.

Gupta

Patrick

Automated validation of patient safety clinical incident classification: macro analysis. Stud Health Technol Inform 2013; 188: 52–57.

20.

Jordan

. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In: Advances in neural information processing systems, 2002; pp. 841–848, https://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf

21.

Yadav

Sarioglu

Smith

, et al. Automated outcome classification of emergency department computed tomography imaging reports. Acad Emerg Med 2013; 20: 848–854.

22.

Japkowicz

Stephen

The class imbalance problem: a systematic study. Intell Data Anal 2002; 6: 429–449.

23.

Fujita

Akiyama

Toyama

, et al. Detecting effective classes of medical incident reports based on linguistic analysis for common reporting system in Japan. Stud Health Technol Inform 2013; 192: 137–141.

24.

Pollettini

Panico

Daneluzzi

, et al. Using machine learning classifiers to assist healthcare-related decisions: classification of electronic patient records. J Med Syst 2012; 36: 3861–3874.

25.

Frank

Hall

Witten

. The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques". 4th ed. Burlington: Morgan Kaufmann, 2016.

26.

Alicante

Amato

Cozzolino

, et al. A study on textual features for medical records classification. Stud Health Technol Inform 2014; 207: 370–379.

27.

Kachites

Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering. 1996, http://www.cs.cmu.edu/~mccallum/bow

28.

Alparslan

Karahoca

Bahşi

Classification of confidential documents by using adaptive neurofuzzy inference systems. Proced Comput Sci 2011; 3:1412–1417.

29.

Ittoo

Nguyen

van den Bosch

Text analytics in industry: challenges, desiderata and trends. Comput Ind 2016; 78: 96–107.

30.

Manning

Raghavan

Schutze

Introduction to information retrieval. Cambridge: Cambridge University Press, 2009.

31.

Batista

Prati

Monard

MC.

A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 2004; 6: 20–29.

32.

Wong

Akiyama

Statistical text classifier to detect specific type of medical incidents. Stud Health Technol Inform 2013; 192: 1053.

33.

Tenório

Hummel

Cohrs

, et al. Artificial intelligence techniques applied to the development of a decision-support system for diagnosing celiac disease. Int J Med Inform 2011; 80: 793–802.

34.

Tanguy

Tulechki

Urieli

, et al. Natural language processing for aviation safety reports: from classification to interactive analysis. Comput Ind 2016; 78: 80–95.

35.

Penz

Wilcox

Hurdle

JF.

Automated identification of adverse events related to central venous catheters. J Biomed Inform 2007; 40: 174–182.

36.

Savova

Fan

, et al. Discovering peripheral arterial disease cases from radiology notes using natural language processing. AMIA Annu Symp Proc 2010; 2010: 722–726.

37.

Moise

Pournaras

Helbing

Classification and decision trees, https://www.ethz.ch/content/dam/ethz/special-interest/gess/computational-social-science-dam/documents/education/Spring2016/Datascience/Classification%20and%20Decision%20Trees.pdf (2016, accessed 7 December 2016).

38.

Hripcsak

Austin

JHM

Alderson

, et al. Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. Radiology 2002; 224: 157–163.

39.

Hou

Chang

Nguyen

, et al. Automated identification of surveillance colonoscopy in inflammatory bowel disease using natural language processing. Dig Dis Sci 2013; 58: 936–941.

40.

Sevenster

van Ommering

Qian

Automatically correlating clinical findings and body locations in radiology reports using MedLEE. J Digit Imaging 2012; 25: 240–249.

41.

Mikolov

. Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning, Beijing, China, 21–26 June 2014, p. 32. New York: ACM.

42.

Danso

Atwell

Johnson

O. A

comparative study of machine learning methods for verbal autopsy text classification. International Journal of Computer Science Issues, 2013; 10(6).

43.

Fujita

Akiyama

Park

, et al. Linguistic analysis of large-scale medical incident reports for patient safety. Stud Health Technol Inform 2012; 180: 250–254.

44.

Wang

Coiera

Runciman

, et al. Using multiclass classification to automate the identification of patient safety incident reports by type and severity. BMC Med Inform Decis Mak 2017; 17: 84.

45.

Alghoson

AM.

Medical document classification based on MeSH. New York: IEEE, 2013, pp. 2571–2575.

46.

Stanfill

Williams

Fenton

, et al. A systematic literature review of automated clinical coding and classification systems. J Am Med Inform Assoc 2010; 17: 646–651.

47.

Warrer

Hansen

Juhl-Jensen

, et al. Using text-mining techniques in electronic patient records to identify ADRs from medicine use. Br J Clin Pharmacol 2012; 73: 674–684.

48.

Oster

Jr Strong

Zorn

CK.

Analyzing aviation safety: problems, challenges, opportunities. Res Transp Econ 2013; 43: 148–164.

Automated classification of primary care patient safety incident report content and severity using supervised machine learning (ML) approaches

Abstract

Keywords

Background and significance

Aim

Materials and methods

Classifiers

Data sources

Subset

Dataset processing – characteristics

Software

Preprocessing incident reports

Data security

Training and testing the individual classifiers

Statistics and analyses

Ethical research considerations

Results

Incident type classification – highest level incident categories (0–10)

Incident type classification – expanded incident categories (0.1–10)

Severity classification

Discussion

Strengths and limitations

Comparison with previous work

Recommendations for future work

Conclusion

Footnotes

Appendix 1

Appendix 2

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iD

References