I Know Your Triggers: Defending against Textual Backdoor Attacks with Benign Backdoor Augmentation
- Yue Gao ,
- Jack W. Stokes ,
- Manoj Prasad ,
- Andrew T. Marshall ,
- Kassem Fawaz ,
- Emre Kiciman
Military Communications Conference |
Published by IEEE | Organized by IEEE
A backdoor attack seeks to introduce a backdoor into a machine learning model during training. A backdoored
model performs normally on regular inputs but produces a target output chosen by the attacker when the input contains a specific trigger. Backdoor defenses in computer vision are well-studied. Previous approaches for addressing backdoor attacks include 1) cryptographically hashing the original, pristine training and validation datasets to provide evidence of tampering and 2) using machine learning algorithms to detect potentially modified examples. In contrast, textual backdoor defenses are understudied. While textual backdoor attacks have started evading defenses through invisible triggers, textual backdoor defenses have lagged. In this work, we propose Benign Backdoor Augmentation (BBA) to fill the gap between vision and textual backdoor defenses. We discover that existing invisible textual backdoor attacks rely on a small set of publicly documented textual patterns. This unique limitation enables training models with increased robustness to backdoor attacks by augmenting the training and validation datasets with backdoor samples and their true labels. In this way, the model can learn to discard the adversarial connection between the trigger and the target label. Extensive experiments show that the defense can effectively mitigate and identify invisible textual backdoor attacks where existing defenses fall short.