Robust Few-Shot Vision-Language Model Adaptation

Hanxin Wang1,†, Tian Liu2,†, Shu Kong1,3,*
1University of Macau, 2Texas A&M University, 3Institute of Collaborative Innovation
Equal contribution, *Corresponding author

Overview

Pretrained Vision-Language Models (VLMs) achieve strong performance on downstream tasks when adapted with just a few labeled examples. However, the few-shot adapted models inevitably encounter out-of-distribution (OOD) test data that deviates from the in-distribution (ID) task-specific training data.

ablation table

We propose SRAPF, Stage-wise Retrieval Augmentation-based Adversarial Partial Finetuning, a robust few-shot VLM adaptation method. It consists of two finetuning stages: (1) partial finetuning of the visual encoder using both ID and retrieved data, followed by (2) adversarial partial finetuning using few-shot ID data. Extensive experiments on ImageNet-based OOD benchmarks demonstrate that SRAPF significantly outperforms existing VLM adaptation methods in both ID and OOD accuracy.

Partial Finetuning improves both ID and OOD performance

We first explore the effect of tuning different blocks of the visual encoder using both Contrastive Tuning (CT) and Finetuning (FT). Note that CT also finetunes the textual encoder: when finetuning the top-X blocks of the visual encoder, it also updates the top-X blocks of the textual encoder. As shown in the table below, for both FT and CT, finetuning only the top few blocks yields better ID and OOD accuracy than finetuning the top linear layer and all the blocks (i.e., full finetuning). Moreover, when carefully selecting blocks to finetune, CT does not exhibit a clear advantage over FT.

PFT table

Retrieval Augmentation and Adversarial Perturbation improve ID and OOD accuracy


Retrieval Augmentation

Retrieval Augmentation (RA) is an established technique that leverages publicly available data to enhance performances on downstream tasks. It retrieves task-relevant examples and uses them to adapt pretrained models. SWAT reports that the retrieved data has domain gaps compared to task-specific training data, which might be a good thing in terms of enhancing the adapted model's OOD generalization capability.

We adopt string matching-based RA approach to retrieve images from the VLM's pretraining dataset LAION-400M. The results of incorporating RA with PFT show that RA yields significant OOD accuracy gains.

RA figure


RA figure

Adversarial Perturbation

Adversarial Perturbation (AP) perturbs input data by purposefully attacking the learned model. One method of AP is to use projected gradient descent (PGD) to iteratively perturb an input example using the negative loss function. Incorporating AP in learning is shown to improve the robustness of the learned model to adversarial attacks, but under-explored in OOD generalization research.

We apply it on features with PFT for robust few-shot VLM adaptation, yielding remarkable improvements for both ID and OOD accuracy.

SRAPF: Stage-wise Retrieval Augmentation-based Adversarial Partial Finetuning


ablation table

Naive combination

Our experiments reveal that naively combining RA and AP does not necessarily improve both ID and OOD accuracy. We conjecture the reasons are

  1. too much noise in the retrieved examples to make them OOD related to the task-specific data.
  2. the inherent imbalance in the retrieved data.
To address these issues, we propose a stage-wise adaptation pipeline SRAPF.


Stage-wise finetuning

Through empirical evaluation of stage-wise approaches, we develop SRAPF (highlighted in blue) as the final solution, carefully considering computational costs:

Stage 1: Partial Finetuning the visual encoder using both the task-specific data and the retrieved data.

Stage 2: Incorporating adversarial perturbation of partial finetuning using only the task-specific data.

SRAPF achieves state-of-the-art performance


SOTA figure

We compare representative VLM adaptation methods categorized into Prompt Tuning (PT), Finetuning (FT), Adapter Learning (AL) and Linear Probing (LP). Our method SRAPF significantly outperforms existing adaptation methods in both ID and OOD accuracy.

BibTeX


      @article{wang2025robust,
      title={Robust Few-Shot Vision-Language Model Adaptation}, 
      author={Wang, Hanxin and Liu, Tian and Kong, Shu},
      journal={arXiv preprint arXiv:2506.04713},
      year={2025}
      }