At the 2024 Kaggle AutoML Grand Prix — a $75,000 competition featuring hundreds of teams including top AutoML practitioners and Kaggle grandmasters — our fully automated framework placed 10th, making it the only automated agent to score points in the competition. This achievement validated our answer to a question we'd been pursuing: could we eliminate not just the model selection and hyperparameter tuning typically associated with AutoML, but the coding itself?
The promise of automated machine learning has always been democratization. Yet most AutoML tools still require users to write code, prepare data structures, and understand ML workflows. For domain experts without programming backgrounds — scientists analyzing experimental data, analysts building forecasting models, or researchers working with image collections — this coding requirement creates an unnecessary barrier.
We designed AutoGluon Assistant to remove this barrier. Built on MLZero, a novel multiagent system powered by large language models, AutoGluon Assistant transforms natural-language descriptions into trained machine learning models across tabular, image, text, and time series data. The system achieved a 92% success rate on our Multimodal AutoML Agent Benchmark and 86% on the external MLE-bench Lite, with leading performance in both success rate and solution quality.
A multiagent architecture for true automation
Traditional AutoML tools assume clean, structured inputs and users capable of invoking APIs correctly. Real-world ML problems begin with messier realities: ambiguous data files, unclear task definitions, and users who may not know whether they need classification or regression. MLZero addresses this through a multiagent architecture where specialized components powered by large language models from Amazon Bedrock collaborate to transform raw inputs into working solutions.
For example, consider a medical researcher who uploads chest x-ray images with segmentation masks, describing the goal as "locate disease regions in x-rays." The perception module identifies pixel-level segmentation as the task, semantic memory selects AutoGluon's MultiModalPredictor for semantic segmentation, and the iterative coding module generates and refines code. When the initial attempt encounters mask format incompatibilities, episodic memory provides debugging context to adjust preprocessing and postprocessing, successfully training a segmentation model — all without the researcher writing any code.
The system comprises four core modules: perception, semantic memory, episodic memory, and iterative coding. The perception module interprets arbitrary data inputs, parsing file structures and content to build structured understanding regardless of format inconsistencies or ambiguous naming. Where users might provide CSV files without clear indication of target variables, perception analyzes column distributions and semantics to infer task structure.
The semantic-memory module enriches the system with knowledge of ML libraries, maintaining structured information about AutoGluon's capabilities, API patterns, and best practices. Rather than requiring users to know that semantic-segmentation tasks require the SAM model in AutoGluon Multimodal, semantic memory enables the system to select appropriate tools based on task characteristics.
Episodic memory maintains chronological execution records, tracking what the system has attempted, what succeeded, and what failed. When code execution produces errors, this module provides debugging context by surfacing relevant previous attempts and their outcomes. This addresses the iterative nature of ML development, where solutions emerge through refinement rather than appearing fully formed.
The iterative-coding module implements a refinement process with feedback loops and augmented memory. Generated code executes, produces results or errors, and informs subsequent attempts. This continues until either successful execution or a maximum iteration limit, with optional per-iteration user input for guidance when needed. The architecture maintains high automation while preserving flexibility for human oversight.
Through this comprehensive system, MLZero bridges the gap between noisy raw data and sophisticated ML solutions. The multiagent collaboration pattern proves effective across modalities because the architecture separates concerns — understanding data, knowing capabilities, tracking history, and generating code — that traditionally intertwine in single-agent systems.
Breaking down results
To validate our system against an established, external standard, we first evaluated it on MLE-bench Lite. This benchmark consists of 21 diverse challenges from previous Kaggle competitions, allowing us to directly compare our model's performance to those of other leading automated systems. Our model achieved the highest success rate, 86%, meaning it successfully completed and submitted valid solutions for 18 of the 21 challenges. It secured the top position in overall solution quality, with an average rank of 1.43 in the standings, compared to the next-best agent's 2.36. Our agent won six gold medals and outperformed all competitors in total medal counts across the benchmark's challenges.
After proving our model's capabilities on an existing benchmark, we further tested it on our own Multimodal AutoML Agent Benchmark, a more challenging suite comprising 25 diverse tasks with less-processed datasets, where data is closer to its raw form with more noise, format inconsistencies, and ambiguities. This benchmark features multiple data modalities (tabular, image, text, document) and problem types (classification, regression, retrieval, semantic segmentation) and challenging data structures (multilingual, multitable, and large-scale datasets). AutoGluon Assistant (as MLZero) achieved a 92% success rate across all tasks. When implemented with a compact, eight-billion-parameter LLM, the system still achieved a 45.3% success rate, proving more effective than many larger, more resource-intensive agents.
Accessible interfaces for diverse workflows
AutoGluon Assistant supports multiple interaction modes to fit different user preferences and workflows. Users can invoke the system through a command-line interface for quick automation tasks, a Python API for integration into existing data pipelines, or a Web UI for visual interaction and monitoring, or they can use the Model Context Protocol (MCP) to integrate it with other agentic tools. This flexibility ensures that whether users prefer scripting, graphical interfaces, or programmatic control, they can access the same underlying automation capabilities.
The system also supports optional per-iteration user input, allowing domain experts to inject specialized knowledge during iterative refinement while maintaining automation for routine use. When working with medical imaging data, for instance, experts might guide the system toward custom normalizations specific to their scanning protocols. Episodic memory tracks these interventions alongside system-generated attempts, creating a collaborative dynamic where automation handles mechanical complexity while users contribute strategic direction when they possess relevant insights.
The system is open source and available on Github, with technical details published in our NeurIPS 2025 paper.