BACKGROUND:
This cross-sectional retrospective study utilized Natural Language Processing (NLP) to extract tobacco-use associated variables from clinical notes documented in the Electronic Health Record (EHR).
OBJECITVE:
To develop a rule-based algorithm for determining the present status of the patient’s tobacco-use.
METHODS:
Clinical notes (
5,371 documents) from 363 patients were mined and classified by NLP software into four classes namely: “Current Smoker”, “Past Smoker”, “Nonsmoker” and “Unknown”. Two coders manually classified these documents into above mentioned classes (document-level gold standard classification (DLGSC)). A tobacco-use status was derived per patient (patient-level gold standard classification (PLGSC)), based on individual documents’ status by the same two coders. The DLGSC and PLGSC were compared to the results derived from NLP and rule-based algorithm, respectively.
RESULTS:
The initial Cohen’s kappa (
1,000 documents) was 0.9448 (95% CI
0.9281–0.9615), indicating a strong agreement between the two raters. Subsequently, for 371 documents the Cohen’s kappa was 0.9889 (95% CI
0.979–1.000). The F-measures for the document-level classification for the four classes were 0.700, 0.753, 0.839 and 0.988 while the patient-level classifications were 0.580, 0.771, 0.730 and 0.933 respectively.
CONCLUSIONS:
NLP and the rule-based algorithm exhibited utility for deriving the present tobacco-use status of patients. Current strategies are targeting further improvement in precision to enhance translational value of the tool.