Sage Journals: Discover world-class research

Abstract

Keywords

machine learning oropharynx computer vision

Introduction

Current diagnosis of oropharyngeal squamous cell carcinoma (OPSCC) involves endoscopy, PET, and biopsy of suspicious lesions. Video nasopharyngolaryngoscopy (NPL) is a common, minimally-invasive data source frequently used to evaluate the location of the primary OPSCC tumor. However, it is difficult for clinicians to visually determine OPSCC from Video NPL alone, especially in the posttreatment setting to evaluate for recurrence.^1,2 This leads to many patients needing an invasive, expensive biopsy to confirm OPSCC. An accurate ML model that determines OPSCC from Video NPL could eliminate an invasive procedure for many patients, reduce patient anxiety, and reduce cost of care. Advancements that improve diagnostic capacity with minimal disruption of current clinical workflow would improve the identification of unknown primary tumors and local surveillance after the treatment of OPSCC and other head and neck cancers (HNC).

Over the last decade, advancements in machine learning (ML) combined with collecting large medical datasets have resulted in increased research in the application of ML in medicine,³ including disease diagnosis, triage and prognostication, clinical decision-making, surgical planning, intraoperative assistance, and patient education.⁴ Each year between 2005 and 2019 has had an estimated 61-fold increase in the number of papers applying ML to medicine. The applicability of ML in cancer diagnosis has seen a surge in evidential support with image analysis methods.^5-10

Similarly, research on ML applications in HNC has also increased.³ Studies have used endoscopic data to detect cancer, but this has generally been multispectral narrow band imaging (mNBI).² No previous research has used video endoscopy. In this paper, we propose a novel machine learning application in HNC, using deep learning methods to classify whether there is evidence of OPSCC in video NPL. This pilot work lays the groundwork for a model that could be applied to video laryngoscopy as a diagnostic-assist tool, particularly in cases where the identification of a malignancy may not be readily obvious to the clinician.

Methods

Video NPL was provided from The Ohio State University Wexner Medical Center from 85 patients undergoing treatment or follow-up care for OPSCC from January 2019 to April 2022 (Table 1). We included patients with diagnosis of oropharyngeal cancer who had recordings of their NPL either prior to treatment or after the completion of definitive radiation ± chemotherapy. It is our institutional standard to attempt to record all initial and follow-up NPL examinations as part of the medical record. Of the video data classified by clinicians, 17 patients showed no evidence of disease in the video taken posttreatment, 65 patients showed active signs of disease in the pretreatment video, and 3 patients showed recurrence in the posttreatment video totaling 68 patients with evidence of disease. All posttreatment cases followed up within 6 months after ending treatment. All patients were treated only with chemoradiation.

Table 1.

Patient Characteristics.

Characteristic	Patients with No Evidence of Disease at Scope	Patients with Evidence of Disease at Scope
Number	17	68
Age, mean (range)	60.88 (48–84)	60.15 (29-77)
Sex
Male	12	21
Female	5	47
Recurrence of disease at time of scope	0	3

A 2D Convolutional Neural Network (CNN) model was trained using the full Video NPL data for each patient. A CNN is a type of artificial neural network that is well-suited for image and video data. An important aspect of the CNN architecture is the ability to learn local features that are relevant to the classification task. Learning local features is critical for the identification of OPSCC since evidence of malignancy will be present in video frames that also contain normal tissue, and entire frames in the Video NPL are often unimportant to the overall classification. The benefits of using the full video are as follows: (1) experts are not required to create a large, labeled dataset of relevant images from a video; (2) clinicians during practice are not required to guess which video frame is relevant; and (3) the deep learning takes the entire video and learns which frames are relevant. Figure 1 shows an example of a frame from the Video NPL that shows a tumor in the bottom left corner.

Figure 1.

A frame from a patient’s Video NPL that has a visible tumor in the bottom left corner (green arrow added by authors for clarity). NPL, nasopharyngolaryngoscopy.

As each frame is processed independently by the CNN using 2D convolutions, to extract feature-wise dependencies between frames within the video we add a frame-wise dot-product multi-head self-attention layer. To extract frame-wise dependencies important to the classification task, self-attention is an alternative to using a recurrent neural network or 3D convolutional approach. One benefit of this approach is that each frame can be attended to by itself as well as by every other frame making self-attention capable of capturing longer range dependencies. Figure 2 details our proposed CNN architecture, and for clarity, we separate the architecture into 3 separate phases. Phase A shows how we use 2D convolutions on each frame independently followed by ReLU nonlinearity to extract relevant features from each frame. Then, every two frames are averaged together using average pooling before passing it through another 2D convolution + ReLU. We average frames to smooth and reduce the length of the video before sending it into Phase B. As the video is continuous, subsequent frames will be highly correlated and averaging will result in minimal information loss, but large improvements to the inference time of the overall network. Phase B shows how we flatten each frame to pass through a multi-head self-attention mechanism to model frame-wise dependencies across the entire Video NPL. Then, in phase C, after reshaping each frame back to its original shape before the multi-head self-attention layer, every 2 frames are again averaged and passed through another 2D convolution + ReLU. The entire feature space is then flattened and goes through a fully-connected layer for the final prediction of “active cancer” or “no evidence of disease” for the entire Video NPL.

Figure 2.

Our proposed CNN Architecture to predict Cancer or No Cancer in Video NPL. We explain the architecture in 3 phases. Phase A extracts features from each frame independently using a 2D Convolution followed by ReLU. Then, every 2 frames are averaged together before applying another 2D Convolution + ReLU. Phase B applies our multi-head self-attention layer to capture the frame-wise dependencies. Phase C then averages every 2 frames together, followed by one more 2D convolution + ReLU. Then, the entire feature space is flattened and passed through fully-connected layer to predict probability of cancer or no cancer. CNN, convolutional neural network; NPL, nasopharyngolaryngoscopy.

The CNN was built in PyTorch and trained using early stopping with a max number of epochs of 100. With the overall objective to minimize the cross-entropy loss, if the validation loss did not improve for 20 epochs, we stopped the training process. For training, the Adam optimizer was used with an initial learning rate of 0.001 and a batch size of 4. No data augmentation methods were used for this study. The model was trained using a single NVIDIA Volta V100 with 32 GB GPU memory, provided by the Ohio Supercomputer Center. After training, when a new, never-before seen Video NPL is input to the model, the CNN outputs a classification of “active cancer” or “no evidence of disease.”

To validate generalization performance, we used fivefold stratified cross-validation and aggregated results across folds. This means for each fold, roughly 68 patients’ Video NPL would be used for training and 17 patients’ Video NPL would be used to assess validation performance. We used a cross-validation strategy to ensure that every patient within our dataset will eventually be used in the validation set during the cross-validation. As our pilot dataset is small (85 patients), if we used a train-test-validation split to evaluate performance, we would not have enough representation in a single validation set to adequately assess the model’s ability to generalize. Additionally, the use of stratified cross-validation ensured that the proportion of “active cancer” and “no evidence of disease” in each test set was representative of that seen in the training set. This helps to ensure we test the ability of our model to generalize based on representative samples that match the imbalanced nature of the data set (eg, there were more patients in the dataset with active signs of disease).

Results

Due to the class imbalance, we report sensitivity, specificity, AUC, precision, and the F1 score in addition to accuracy in Table 2 to better understand performance.

Table 2.

Cross-Validated Results.

Metric	Cross-Validated Result	ML Interpretation
Accuracy	0.88 ± 0.06	Proportion of patients correctly classified
Recall	0.90 ± 0.06	Proportion of “active cancer” patients classified correctly
Specificity	0.82 ± 0.2	Proportion of “no evidence of disease” patients classified correctly
AUC	0.82 ± 0.11	An aggregation of performance across all prediction thresholds
Precision	0.94 ± 0.07	the probability a classification of “active cancer” is actually cancer
F1 Score	0.89 ± 0.10	A balanced score measuring how well the model identifies “active cancer” without sacrificing precision.

Our proposed model architecture achieved a cross-validated accuracy of 0.88, sensitivity of 0.90, specificity of 0.82, AUC of 0.82, precision of 0.94, and F1 score of 0.89. Such a high precision indicates that we can have confidence when the model predicts evidence of OPSCC. For every prediction of evidence of OPSCC, we have shown it will be a true positive 94% (cross-validated) of the time. Additionally, the high AUC, sensitivity and specificity means that we can accurately predict both classes and that we are not achieving high accuracy by only predicting evidence of OPSCC, but rather can identify patients without evidence of a malignancy as well. In the future, we expect this method to be a useful, real-time clinical tool as the network is lightweight and takes only 0.02 s to classify each Video NPL.

Discussion

Overall, our pilot results show excellent performance in the ability of the ML model to predict active cancer or no evidence of disease in Video NPL, especially considering the small data size. While the results are encouraging, there is some variation in performance across folds, especially in specificity. We expect this and overall performance to improve with a larger dataset.

This is the first study showing potential use of standard recorded Video NPL to detect head and neck cancer through machine learning. The presented model is a pilot proof-of-concept that demonstrates very promising early capabilities but is being validated on a larger dataset. The goal was to develop a diagnostic aid usable by both ENT and non-ENT experts (eg, radiation oncologists or speech and language pathologists) who do not have the ability to biopsy, especially in posttreatment surveillance and identification of the unknown primary tumor for OPSCC. The long-term vision is to develop a suite of algorithms for numerous clinical applications including cancer detection across the larynx and pharynx, swallowing function, vocal cord paralysis, other benign laryngopharyngeal processes, and detection of clinical and subclinical toxicity from surgery and chemoradiation.

Conclusion

Our results validate that ML can be used to predict whether there is active cancer or no evidence of disease in Video NPL. After verification on a larger dataset, this method and model could be applied to video laryngoscopy as a diagnostic-assist tool to help in cases where a biopsy is not possible.

Limitations

While our pilot results indicate that ML can be used to predict evidence of disease, there is further validation that must be performed to substantiate our collected evidence. For this study, we only had data from The Ohio State University Wexner Medical Center; however, assessing the ability for the model to generalize across institutions will help us to better understand potential limitations of our model. Additionally, larger datasets will be required to capture all the potential variability within the patient population, preferably with non-oncology patients.

Footnotes

Acknowledgements

None.

Authors Contributions

RC contributed to the creation of the model architecture, model training, and model evaluation. AR contributed to the preparation of the dataset. SR, KV, and SK contributed to the conception and design of the experiment. All authors contributed to the preparation of the manuscript.

Availability of data and materials

The datasets generated during and/or analyzed during the current study are not publicly available due to containing protected health information.

Consent for publication

Not applicable.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Ethics approval and consent to participate

Not applicable.

ORCID iDs

Ryan Gifford

Abigail Reid

Sachin R. Jhawar

Kyle VanKoevering

References

Muto

Nakane

Katada

, et al Squamous cell carcinoma in situ at oropharyngeal and hypopharyngeal mucosal sites. Cancer. 2004;101(6):1375-1381. doi:10.1002/cncr.20482

Mascharak

Baird

Holsinger

FC.

Detecting oropharyngeal carcinoma using multispectral, narrow-band imaging and machine learning. The Laryngoscope. 2018;128(11):2514-2520. doi:10.1002/lary.27159

Mahmood

Shaban

Rajpoot

Khurram

SA.

Artificial Intelligence-based methods in head and neck cancer diagnosis: an overview. Br J Cancer. 2021;124(12):1934-1940. doi:10.1038/s41416-021-01386-x

Kuo

RYL

Harrison

Jones

Geoghegan

Furniss

. Perspectives: a surgeon’s guide to machine learning. Int J Surg. 2021;94:106133. doi:10.1016/j.ijsu.2021.106133

Kourou

Exarchos

Karamouzis

Fotiadis

DI.

Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:8-17. doi:10.1016/j.csbj.2014.11.005

Ehteshami Bejnordi

Veta

Johannes van Diest

, et al Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA. 2017;318(22):2199-2210. doi:10.1001/jama.2017.14585

Bera

Schalper

Rimm

Velcheti

Madabhushi

Artificial intelligence in digital pathology—new tools for diagnosis and precision oncology. Nat Rev Clin Oncol. 2019;16(11):703-715. doi:10.1038/s41571-019-0252-y

Zormpas-Petridis

Failmezger

Raza

SEA

Roxanis

Jamin

Yuan

Superpixel-based conditional random fields (SuperCRF): incorporating global and local context for enhanced deep learning in melanoma histopathology. Front Oncol. 2019;9:1045. doi: 10.3389/fonc.2019.01045

Wang

Yang

Rong

, et al Artificial intelligence in lung cancer pathology image analysis. Cancers. 2019;11(11):1673. doi:10.3390/cancers11111673

10.

Sirinukunwattana

Ahmed Raza

Yee-Wah

Tsang null

Snead

DRJ

Cree