BELIEVE: Belief-enhanced instruction generation and augmentation for zero-shot bias mitigation
2024
Language models, pre-trained on large amounts of unmoderated content, have been shown to contain societal biases. Mitigating such biases typically requires access to model parameters and training schemas. In this work, we address bias mitigation at inference time, such that it can be applied to any black-box model. To this end, we propose a belief generation and aug-mentation framework, BELIEVE, that demonstrates effective bias mitigation for natural language generation by augmenting input prompts with automatically generated instruction-based beliefs. Our framework eases the bottleneck required for manually crafting these instruction-based beliefs, by extending a recently pro-posed iterative in-context learning framework (Mehrabi et al., 2023) to automatically generate beliefs via a language model. We assess the impact of this system on fairness, and demonstrate effective bias mitigation on pre-trained and instruction-tuned models for both sentiment and regard with respect to multiple protected classes including race, gender, and political ideology.
Research areas