Fine-tuned vision-language models (VLMs) have shown remarkable performance across many computer vision tasks. However, backpropagation — the standard method for adjusting model weights during fine tuning, which works backward from output error — is computationally expensive and thus impractical on resource-constrained edge devices.
An alternative is fine-tuning strategies that rely solely on forward passes, significantly lowering the computational requirements. Zeroth-order (ZO) estimation is one such method, but existing ZO-based VLM fine-tuning methods remain substantially inferior to backpropagation-based training in terms of accuracy and convergence.
One major challenge is ZO’s high variance, which can make estimated gradients — the directions of weight adjustment resulting from a batch of training data — inconsistent and noisy. This can lead to unstable training dynamics and make it difficult for the model to converge to an optimal solution. Additionally, ZO estimation tends to have local search dynamics, meaning that it may get stuck in locally optimal but globally suboptimal regions of the loss landscape.
In a paper we presented at this year’s Conference on Neural Information Processing Systems (NeurIPS 2025), we propose SharpZO, a hybrid sharpness-aware zeroth-order optimization approach for fine-tuning VLMs using only forward passes. SharpZO has a two-stage optimization process: (1) a global exploration stage that uses evolutionary strategies to smooth the loss landscape, constructing a strong initialization, and (2) a local-search stage that uses ZO to suppress outlier gradient estimates.
In experiments, SharpZO improved on the accuracy of forward-only methods such as ZIP and BlackVIP by an average of up to 7%, and on several tasks, its performance approached that of CoOP, a first-order method requiring backpropagation of gradients.
The loss landscape
Given a model and a set of training data, every one of the model’s possible parameters (weights and biases) can be mapped against the corresponding loss, or error, on the training data, yielding a single point in a very-high-dimensional space. The graph of parameter settings against loss can be envisioned as a landscape with peaks (high-loss regions) and valleys (low-loss regions). The goal of training is to steer the parameter settings toward the bottom of the lowest valley in the landscape.
Computing the complete landscape is intractable, but given a particular location (set of parameter settings), it’s possible to calculate the local direction of the slope — the gradient — and nudge the loss downhill. That’s how backpropagation works.
ZO is a method for estimating, rather than calculating, the local gradient, by sampling the loss at various nearby points in the landscape. But the high variance of ZO’s estimates makes the landscape look more jagged — or sharper — than it really is, with more and higher peaks. This increases the chances that the optimization algorithm will get stuck in a local minimum, a local valley where the loss is actually significantly greater than at the global minimum.
Our approach is to use an evolutionary algorithm — specifically, a sharpness-aware covariance-matrix adaptation evolution strategy (CMA-ES) — to smooth out the sharpness of the loss landscape. Then we use a slightly modified ZO algorithm to find the global minimum.
SharpZO
CMA-ES estimates not just the local gradient but the distribution of the loss over the whole set of possible parameter values. It also estimates the distribution’s covariance matrix, a matrix that describes the correlations between parameter values. Both the mean of the distribution and the values of the covariance matrix are updated after every round of training.
We modify the ordinary CMA-ES approach by including an extra term in the loss function, which accounts for the worst possible loss that the model could incur, given the current estimate of the distribution and covariance matrix. Minimizing this term helps smooth out the estimated loss landscape.
After applying CMA-ES, we use a modified sparse ZO algorithm to do more refined local searches. Traditional sparse ZO reduces the dimensionality of the gradient estimate by tossing out low-magnitude terms. We modify this procedure by normalizing the gradient vector according to its mean and standard deviation, which again helps smooth out the loss landscape.
Evaluation
We evaluated SharpZO on 11 diverse downstream tasks using CLIP models with various backbones. In addition to the average accuracy improvement of 7% over forward-only methods such as ZIP and BlackVIP, and the performance competitive with CoOP, our method achieves significantly faster convergence. For example, on the ImageNet dataset, SharpZO reached target accuracy in 15.3 minutes, compared to 19 mins for ZIP and 170 minutes for BlackVIP.
SharpZO not only reduces the memory footprint by avoiding gradient storage but also ensures that this efficiency does not come at the cost of accuracy. We also found that our method is robust to distribution shifts, performing better than baselines on out-of-distribution tasks, such as recognizing sketches (ImageNet-Sketch) or adversarial examples of images (ImageNet-A).
Currently, SharpZO is optimized for prompt tuning, where the number of trainable parameters is relatively small, and scaling to full-model fine tuning remains a future challenge. Furthermore, the sharpness-aware CMA-ES warmup stage requires coordinate-wise gradient estimation (CGE), which maybe computationally expensive for high-dimensional settings. This makes SharpZO a suitable candidate for parameter-efficient fine tuning (PEFT).
Acknowledgements: This work was done as part of the Amazon-UCSB collaboration. We want to thank Zheng Zhang, Jimmy Kunzmann, and Denis Filimonov for their inputs and valuable discussions.