Data augmentation strategies in transfer learning for large language models for enhancing clinical text analysis

Abstract

The vast volume and complexity of clinical research articles make it challenging for individuals to efficiently access and analyze the data. To tackle this issue, Artificial Intelligence (AI) and Natural Language Processing (NLP) are becoming invaluable for managing unstructured data. The primary obstacles are the scarcity of high-quality labeled data and the specialized terminology in healthcare, which differs significantly from Standard English. Historically, conventional healthcare information extraction systems have heavily relied on human involvement to manually establish extraction rules or create tagged training examples. However, given the enormous amount of data available online and the extensive and ambiguous relationships of interest, it has become essential to move away from models dependent on predefined relationships and high-quality labeled data for information extraction. This proposed study implements a framework based on an AI and NLP large language model with data augmentation strategies while adhering to the semantic network of the healthcare domain. The models demonstrate a substantial improvement in the F1-score.

Keywords

data augmentation clinical text large-Language models natural language processing named entity relationship

Get full access to this article

View all access options for this article.

References

Zhou

, et al. Exploring various knowledge in relation extraction. in Proceedings of the 43rd annual meeting of the association for computational linguistics (acl’05). 2005.

Jing

Huang

. Protein-protein interaction extraction from biomedical literatures based on modified SVM-KNN. in 2009 International Conference on Natural Language Processing and Knowledge Engineering. 2009. IEEE.

Yates

, et al. Textrunner: open information extraction on the web. in Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT). 2007.

Weld

. Open information extraction using Wikipedia. in Proceedings of the 48th annual meeting of the association for computational linguistics. 2010.

Schmitz

, et al. Open language learning for information extraction. in Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. 2012.

Fader

Soderland

Etzioni

. Identifying relations for open information extraction. in Proceedings of the 2011 conference on empirical methods in natural language processing. 2011.

Akbik

Broß

. Wanderlust: Extracting semantic relations from natural language text using dependency grammar patterns. in www workshop. 2009. Citeseer.

Akbik

Löser

. Kraken: N-ary facts in open information extraction. in Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX). 2012.

Mesquita

Schmidek

Barbosa

. Effectiveness and efficiency of open relation extraction. in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013.

10.

Del Corro

Gemulla

. Clausie: clause-based open information extraction. in Proceedings of the 22nd international conference on World Wide Web. 2013.

11.

Angeli

Premkumar

MJJ

Manning

. Leveraging linguistic structure for open domain information extraction. in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015.

12.

Liu

, et al. Convolution neural network for relation extraction. in International conference on advanced data mining and applications. 2013. Springer.

13.

Zeng

, et al. Relation classification via convolutional deep neural network. in Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. 2014.

14.

Zhang

Wang

. Relation classification via recurrent neural network. 2015.

15.

Zhou

, et al. Attention-based bidirectional long short-term memory networks for relation classification. in Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers). 2016.

16.

Zhu

, et al. Graph neural networks with generated parameters for relation extraction. j arXiv preprint arXiv:.00756, 2019.

17.

, et al. Runoff forecast model based on graph attention network and dual-stage attention mechanism. J Comput Applic 2022; 42: 1607.

18.

Aronson

Lang

F-M

. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc 2010; 17: 229–236.

19.

SNOMED CT. [cited 2024 May, 2024]; Available from: URL: http://www.snomed.org/.

20.

ICD 10. [cited 2024 May, 2024]; Available from: URL: https://www.cdc.gov/nchs/icd/icd-10/?CDC_AAref_Val=https://www.cdc.gov/nchs/icd/icd10.htm.

21.

Documentation

. Description of MetaMap Data Versions. [cited 2024 May, 2024]; Available from: URL: https://lhncbc.nlm.nih.gov/ii/tools/MetaMap.html.

22.

cTAKES. Apache cTAKES. [cited 2024 May, 2024]; Available from: URL: https://ctakes.apache.org/.

23.

Whetzel

, et al. Bioportal: enhanced functionality via new web services from the national center for biomedical ontology to access and use ontologies in software applications. Nucleic Acids Res 2011; 39: W541–W545.

24.

Demner-Fushman

Rogers

Aronson

. Metamap lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc 2017; 24: 841–844.

25.

Bodenreider

McCray

. Exploring semantic groups through visual approaches. J Biomed Inform 2003; 36: 414–432.

26.

Manning

, et al. The Stanford CoreNLP natural language processing toolkit. in Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 2014.

27.

Bird

Klein

Loper

. Natural language processing with Python: analyzing text with the natural language toolkit. Sebastopol, CA, USA: O'Reilly Media, Inc, 2009.

28.

spaCy. spaCy. [cited 2024 December, 2024]; Available from: URL: https://spacy.io/.

29.

OpenNLP. OpenNLP. [cited 2024 May, 2024]; Available from: URL: https://opennlp.apache.org/.

30.

Cunningham

, et al. Getting more out of biomedical documents with GATE's full lifecycle open source text analytics. PLoS Comput Biol 2013; 9: e1002854.

31.

UD. Universal Dependencies. [cited 2024 May, 2024]; Available from: URL: https://universaldependencies.org/.

32.

GPT-3. Generative Pre-trained Transformer 3. [cited 2024 May, 2024]; Available from: URL: https://openai.com/index/gpt-3-apps/.

33.

Lee

, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. J Bioinformatics 2020; 36: 1234–1240.

34.

Luo

, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform 2022; 23: bbac409.

35.

Beltagy

Cohan

. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:.10676, 2019.

36.

, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare 2021; 3: 1–23.

37.

Huang

Altosaar

Ranganath

. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:.05342, 2019.

38.

Alsentzer

, et al. Publicly available clinical BERT embeddings. arXiv preprint arXiv:.03323, 2019.

39.

Addula

, et al. A hybrid autoencoder and gated recurrent unit model optimized by honey badger algorithm for enhanced cyber threat detection in iot networks. J Security and Privacy 2025; 8: e70086.

40.

Kans

. Entrez direct: E-utilities on the UNIX command line, in Entrez programming utilities help [Internet]. USA: National Center for Biotechnology Information (US), 2024.

41.

OA. OA Web Service API. [cited 2024 May, 2024]; Available from: URL: https://www.ncbi.nlm.nih.gov/pmc/tools/oa-service/.