Abstract
Deep neural networks (DNNs) are vulnerable to one of the biggest security risks, backdoor attacks, which insert hidden triggers that, under certain circumstances, alter or control model predictions. Many recent backdoor mitigation techniques implement the mitigation strategy to a model without knowing if it has been backdoor poisoned, assuming it to run effectively without regard for the target class or classes and related backdoor triggers, and this may affect the model's accuracy on benign cases. In this paper, we propose a Multi-phase Hybrid Defense Framework for Backdoor Detection, Mitigation, and Fine-Tuning in DNNs. The first phase takes advantage of a combined strategy, Confidence-Weighted Scaled Prediction Consistency and Parameter-oriented Scaling Consistency (CW-SPC-PSC), to accurately detect poisoned data. In the second phase, we apply a three-step mitigation pipeline: initial training with a hybrid loss to break the relationship between a selected backdoor label and the trigger pattern which was learned by the attacked model, targeted retraining on detected clean samples to strengthen benign patterns, and backdoor unlearning on poisoned data through adversarial gradient manipulation to help the model unlearn the potentially trigger patterns. The final phase introduces an improved teacher-student-based fine-tuning technique that utilizes Selective Neural Attention Distillation with entropy minimization to remove the backdoor trigger from the model while maintaining the model's functionality on clean samples. Experimental results on standard benchmark datasets against 3 backdoor attacks show that our hybrid defense method significantly decreases backdoor attack success rates while preserving high classification performance, presenting an effective and generalizable defense technique for adversarial-affected DNNs.
Get full access to this article
View all access options for this article.
