Abstract
Information leakage and model attacks pose risks in the analysis and exchange of medical data, as the language models used to process medical records may retain training data. Traditional models, on the other hand, are often too complicated and use old ineffectual methods for removing personal data. This can compromise the data’s integrity and quality, making it less useful for future tasks, especially when combined with other language models. This paper introduces the Self-Decoded Model of Medical De-identification (SDM-M-DID). The model employs a secure BERT-based encoder to paraphrase sensitive data, ensuring HIPAA compliance. Unlike traditional models that only mask sensitive tokens, the SDM-M-DID decodes its own embeddings to generate an internal representations of these tokens. Then, it integrates this representations with the pre-trained BERT dictionary to rephrase tokens, preserving their semantic role while altering grammar to prevent re-identification. Compared to existing large language models, our model achieves a score of 0.8416 F1 BERTscore, striking an optimal balance between the variability and similarity of deidentified tokens. We conducted experiments on two medical datasets to demonstrate the effectiveness of the model. Metrics show that there is only a
Keywords
Get full access to this article
View all access options for this article.
