WARNING: Contains harmful content that can be offensive in nature
Large language models (LLMs) for code generation have enhanced developer productivity while introducing new misuse vectors, as these models can generate potentially harmful code. Existing evaluation methods fail to assess such misuse scenarios, and structured red teaming pipelines for code generation remain under-developed. This paper presents COMET, a systematic, closed-loop red teaming framework developed by Team ASTRO for the Amazon Nova AI Challenge (1). COMET integrates five components that iteratively refine adversarial probes based on empirical feedback. At its core is a dual-pipeline adversarial generator that com-bines structured generation techniques targeting both vulnerable and malicious code through systematic dimensional parameterization across four methods, and utility dataset mutations that exploit utility alignment gaps through three complementary techniques to create stealthy, effective adversarial probes. These are processed through a prompt tuning module, which quantifies adversarial probe effectiveness across commercial code models through a three-stage evaluation process. Refined adversarial probes are deployed through an adaptive planner that probes defense mechanisms and dynamically shifts strategies based on feedback provided by competition organizers. A surrogate model, trained on curated data, supports scal-able offline optimization by simulating defense responses. Empirical results from the Amazon Nova AI Challenge demonstrate COMET’s effectiveness in eliciting harmful outputs from guarded code models. We uncover persistent defense vulner-abilities including failures to detect decomposed multi-step threats, susceptibility to subtly mutated inputs, and limited robustness against compositional adversarial probes. COMET establishes a replicable, data-driven methodology for red teaming code models, advancing safe LLM deployment.
COMET: Closed-loop orchestration for malicious elicitation techniques in code models
2025