Sage Journals: Discover world-class research

Abstract

Information leakage and model attacks pose risks in the analysis and exchange of medical data, as the language models used to process medical records may retain training data. Traditional models, on the other hand, are often too complicated and use old ineffectual methods for removing personal data. This can compromise the data’s integrity and quality, making it less useful for future tasks, especially when combined with other language models. This paper introduces the Self-Decoded Model of Medical De-identification (SDM-M-DID). The model employs a secure BERT-based encoder to paraphrase sensitive data, ensuring HIPAA compliance. Unlike traditional models that only mask sensitive tokens, the SDM-M-DID decodes its own embeddings to generate an internal representations of these tokens. Then, it integrates this representations with the pre-trained BERT dictionary to rephrase tokens, preserving their semantic role while altering grammar to prevent re-identification. Compared to existing large language models, our model achieves a score of 0.8416 F1 BERTscore, striking an optimal balance between the variability and similarity of deidentified tokens. We conducted experiments on two medical datasets to demonstrate the effectiveness of the model. Metrics show that there is only a $\pm 1$ % to $\pm 2 %$ difference in accuracy between the original datasets and the de-identified datasets. In total, this demonstrates that SDM-M-DID not only effectively preserves data integrity and is not inferior in efficiency to new large language models but even improves it in some cases while using a more secure and less resource-intensive technology.

Keywords

Natural Language Processing (NLP)medical data de-identification multi task learning self-decoded language model token paraphrasing

Get full access to this article

View all access options for this article.

References

Vakili

Lamproudis

Henriksson

, et al. Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data. 2022, 4245–4252. https://aclanthology.org/2022.lrec-1.451.

Meystre

Friedlin

South

, et al. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol 2010; 10: 1–16.

Grouin

Griffon

Névéol

. Is it possible to recover personal health information from an automatically de-identified corpus of French EHRs? In: Proceedings of the sixth international workshop on health text mining, information analysis. 2015, pp.31–39.

Libbi

Trienes

Trieschnigg

, et al. Generating synthetic training data for supervised de-identification of electronic health records. Future Int 2021; 13: 136.

Seyedi

Xiong

Nemati

, et al. An analysis of protected health information leakage in deep-learning based de-identification algorithms. arXiv preprint arXiv:2101.12099 2021.

Larbi

IBC

Burchardt

Roller

. Which anonymization technique is best for which NLP task?–It depends. A Systematic Study on Clinical Text Processing. arXiv preprint arXiv:2209.00262 2022.

Mikolov

. Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781, 2013; 3781.

Pennington

Socher

Manning

. Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp.1532–1543.

Devlin

. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

10.

Peters

Neumann

Zettlemoyer

, et al. Dissecting contextual word embeddings: Architecture and representation. arXiv preprint arXiv:1808.08949, 2018.

11.

Carlini

Tramer

Wallace

, et al. Extracting training data from large language models. In: 30th USENIX security symposium (USENIX Security 21), 2021, pp.2633–2650.

12.

Nakamura

Hanaoka

Nomura

, et al. KART: Parameterization of privacy leakage scenarios from pre-trained language models. arXiv preprint arXiv:2101.00036, 2020.

13.

Liu

Chen

, et al. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019.

14.

Raffel

Shazeer

Roberts

, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 2020; 21: 1–67.

15.

Sun

Shao

Qiu

, et al. Colake: Contextualized language and knowledge embedding. arXiv preprint arXiv:2010.00309, 2020.

16.

Dong

Mallinson

Reddy

, et al. Learning to paraphrase for question answering. arXiv preprint arXiv:1708.06022, 2017.

17.

Vaswani

. Attention is all you need. Adv Neural Inf Process Syst 2017: 6000–6010.

18.

Oakley

. HIPAA, HIPPA, or HIPPO: What really is the heath insurance portability and accountability act? Biotechnol Law Rep 2023; 42: 306–318.

19.

Berg

Henriksson

Dalianis

. The impact of de-identification on downstream named entity recognition in clinical text. In: Proceedings of the 11th international workshop on health text mining and information analysis, 2020, pp.1–11.

20.

Vakili

Dalianis

. Utility preservation of clinical text after De-Identification. In: Proceedings of the 21st workshop on biomedical language processing, 2022, pp.383–388.

21.

n2c2 NLP Research Data Sets. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/.

22.

Liu

. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 364 (2019).

23.

Sanh

. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.

24.

Rasmy

Xiang

Xie

, et al. Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digital Med 2021; 4: 86.

25.

Healthcare Dataset, 2024. https://www.kaggle.com/datasets/prasad22/healthcare-dataset.

26.

Loshchilov

Hutter

, et al. Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101, 2017; 5.

27.

Aung

KMM

. Comparison of levenshtein distance algorithm and needleman-wunsch distance algorithm for string matching. PhD thesis, MERAL Portal, 2019.

28.

Anderson

Tarigan

Sharif

. Damerau-Levenshtein distance and cosine similarity to select the optimal word in word typing game. In: AIP conference proceedings, Vol. 2987, AIP Publishing, 2024.

29.

Hanna

Bojar

. A Fine-Grained Analysis of BERTScore. In: Proceedings of the sixth conference on machine translation, Barrault L, Bojar O, Bougares F et al. (eds), Association for Computational Linguistics, Online, 2021, pp.507–517. https://aclanthology.org/2021.wmt-1.59.

30.

Liu

S-y.

Liu

Huang

, et al. LLM-FP4: 4-Bit Floating-Point Quantized Transformers. In: Proceedings of the 2023 conference on empirical methods in natural language processing, Association for Computational Linguistics, 2023, pp.592–605. doi:10.18653/v1/2023.emnlp-main.39.

31.

Kanakarajan

Kundumani

Sankarasubbu

. BioELECTRA: pretrained biomedical text encoder using discriminators. In: Proceedings of the 20th workshop on biomedical language processing, 2021, pp.143–154.

32.

Kotfic, GitHub - kotfic/i2b2-evaluation-scripts: Repository for managing python tools that model standoff annotations for i2b2 2014 challenge. https://github.com/kotfic/i2b2_evaluation_scripts.

33.

Yang

Garibaldi

. Automatic detection of protected health information from clinic narratives. J Biomed Inform 2015; 58: S30–S38. DOI: https://pubmed.ncbi.nlm.nih.gov/26231070/ .

34.

Liu

Tang

Wang

, et al. De-identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform 2017; 75: S34–S42.

35.

Beryozkin

Drori

Gilon

, et al. A Joint Named-Entity Recognizer for Heterogeneous Tag-sets Using a Tag Hierarchy. 2019, 140–150. doi:10.18653/v1/P19-1014. https://aclanthology.org/P19-1014.

36.

Zhao