In this technical report, we present our automated red-teaming framework designed to induce jailbreaks in targeted code-generating Large Language Models (LLMs), prompting them to generate malicious and vulnerable code. As of May 13, according to the latest competition leaderboard, our solution has achieved top performance in the second tournament.
Our solution consists of three primary modules. First, we developed a strategically designed core foundation LLM capable of generating contextually relevant red-teaming prompts across multi-turn conversations. This module leverages existing powerful open-sourced LLMs, incorporating specialized methods to bypass their safety alignments and effectively prevent refusals during jailbreak tasks. Second, we implemented a comprehensive data generation pipeline for creating diverse and strategically critical red-teaming data. This pipeline focuses specifically on generating malicious inputs closely resembling benign data, addressing the benign shortcut vulnerabilities inherent in target models. Additionally, it produces rare, challenging data capable of naturally bypassing existing state-of-the-art (SotA) defensive models, thereby effectively testing out-of-distribution robustness. The third module involves our target model simulation mechanism, which enables the construction of surrogate models for effective local testing of our red-teaming strategies. Furthermore, we share detailed insights into our experimental method-ologies, including a red-teaming Process Reward Model (PRM) and an in-depth data analysis derived from our tournament execution logs.
Stepwise multi-turn jailbreak attacks on code LLMs via task decomposition and test-time scaling
2025