Warning: This report contains partially redacted content that may be offensive to the reader
Large language models (LLMs) may assist users with malicious cybersecurity at-tacks or inadvertently generate code with critical security flaws. These failures stem from their broader inability to reliably identify safe data or generate safe outputs, despite advances in alignment research. We identify three potential contributors to this problem: (1) LLMs are expected to respond immediately without consideration for safety implications; (2) they must infer applicable safety principles solely from training data; and (3) they lack mechanisms to reflect on and revise potentially unsafe responses. To address these challenges, we draw from dual-system theory, combining fast, intuitive responses (system 1) with slower, analytical reasoning (system 2). Building on deliberative alignment, we equip system 2 with an explicit safety specification, enabling the model to reason over concrete safety policies rather than inferring them implicitly. To support reflection and self-correction, we introduce a vulnerable code refiner module that reviews and fixes the model’s outputs using reinforcement learning guided by verifiable security signals from a static analysis tool. Our method achieves strong empirical performance, including an 86.6% defense success rate in fielding malicious prompts and avoiding vulnerable code, while preserving utility. We conclude with early insights on viewing alignment as an emergent capability and propose a method for enhancing refiner robustness via adversarial reinforcement learning.
Secure and useful models are reasonable: Aligning code models via utility-preserving reasoning
2025