Abstract
After a U.S. Coast Guard (USCG) search and rescue (SAR) case, USCG personnel create an after-action report containing a textual narrative of the situation and Coast Guard response efforts. Data analysts explored how to identify reports involving cases with a verified person in the water. With restricted access to compute resources and limiting policy, large language models (LLMs) could not be utilized, so statistical (‘classical’ and non-neural) methods were considered for training a classification model to identify SAR case outcomes from report texts. The dataset was severely imbalanced toward the negative class, and the texts were extremely messy, with many typos and abbreviations. Therefore, an extensive text cleaning pipeline was developed and tested for improving classification performance. The Iterative Token Elimination Algorithm (iTEA) was developed to increase differences in vocabulary between classes. Model improvement was further explored through augmentation of the feature space using non-text data. The best model was an XGBoost model, achieving 0.762 recall and precision (and 0.959 accuracy). Errors from the test set are analyzed to guide future improvements until LLMs can be used, which are expected to improve performance and reduce text cleaning requirements.
Keywords
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
