Enabling Validation for Robust Few-Shot Recognition

Overview

Few-Shot Recognition (FSR) tackles classification tasks by training with minimal task-specific labeled data. Prevailing methods adapt or finetune a pretrained Vision-Language Model (VLM) and augment the scarce training data by retrieving task-relevant but noisy samples from open data sources. The finetuned VLM generalizes decently well to the task-specific in-distribution (ID) test data but struggles with out-of-distribution (OOD) test data.

The core challenge of FSR is data scarcity, extending beyond limited training data to a complete lack of validation data. We introduce a novel validation strategy that harmonizes performance gain and degradation on the few-shot ID data and the retrieved data, respectively. Our validation enables parameter selection for partial finetuning and checkpoint selection, mitigating overfitting and improving test-data generalization. We unify this strategy with robust learning into a cohesive framework: Validation-Enabled Stage-wise Tuning (VEST). Extensive experiments on the established ImageNet OOD benchmarks show that VEST significantly outperforms existing VLM adaptation methods, achieving state-of-the-art FSR performance on both ID and OOD data.

The dilemma of directly adopting retrieved data for validation

Due to data scarcity, FSR cannot provide a validation set and is especially prone to overfitting. Inspired by the recent work SWAT, which retrieves task-relevant data from open data resources to augment few-shot training images, we exploit such data for validation.

A straightforward idea is to pack the retrieved data as the validation set. However, the retrieved data is OOD compared to the task-specific few-shot training data. Consequently, the finetuned model on the ID data yields degraded performance on the retrieved data. This causes the validation logic to favor the pretrained model without any finetuning, preventing effective finetuning to improve the generalization ability.

Our solution: gF1 score

We present a validation strategy based on ''performance gains'' on the ID training data and the retrieved data. Specifically, for each checkpoint, say at epoch-\(i\), we calculate its training accuracy (\(\text{acc}^i_{ID}\)) and the accuracy on the retrieved data (\(\text{acc}^i_{RT}\)). We place all the checkpoints on a 2D plane with \(x\) and \(y\) specifying the two accuracies. Then, for the \(t^{th}\) checkpoint, we compute its ''performance gains'' on the ID training data as \(\Delta_{ID}^t= \text{acc}^t_{ID} - \min_i(\text{acc}^i_{ID})\), and on the retrieved data as \(\Delta_{RT}^t=\text{acc}^t_{RT}-\min_i(\text{acc}^i_{RT})\). We measure the generalization ability of this checkpoint using their harmonic mean (in spirit of F1 score), dubbed gF1: \begin{equation} \text{gF1}^t = 2 \times \frac{\Delta_{ID}^t \times \Delta_{RT}^t}{\Delta_{ID}^t + \Delta_{RT}^t}. \end{equation}

gF1 enables checkpoint selection

Our validation strategy selects the checkpoint at epoch-3 after fully finetuning the VLM. We verify this selection by comparing these checkpoints w.r.t accuracies on the OOD and ID test data. Results show that the selected checkpoint does generalize robustly well to the test data.

gF1 enables layer selection

We extend our validation strategy to decide how many top layers to partially finetune (PFT) in a Transformer. In practice, it selects the top-4 layers. The resulting checkpoint accuracies for different \(k\) on the ID and OOD test sets demonstrat that our validation strategy effectively determines the top-\(k\) layers to PFT for better robustness and generalization.

Retrieval Augmentation and Adversarial Perturbation improve ID and OOD accuracy

Retrieval Augmentation

Retrieval Augmentation (RA) leverages publicly available data to enhance performances on downstream tasks. It retrieves task-relevant examples and uses them to adapt pretrained models.

We adopt string matching-based RA approach to retrieve images from the VLM's pretraining dataset LAION-400M, and employ such data to PFT a VLM. The results of incorporating RA with PFT show that RA yields significant OOD accuracy gains.

Adversarial Perturbation

Adversarial Perturbation (AP) perturbs input data by purposefully attacking the learned model. One method of AP is to use projected gradient descent (PGD) to iteratively perturb an input example using the negative loss function.

While AP typically applies to input image, we apply it to the features with PFT for better efficiency, yielding remarkable improvements for both ID and OOD accuracy.

VEST: Validation-Enabled Stage-wise Tuning

One-stage training

Our experiments reveal that naively combining RA and AP does not yield remarkable improvements on ID and OOD accuracy. We conjecture the reason is that the retrieved examples have

label noises.
distributional shifts compared with task-specific training images.
imbalanced class distributions.

To address these issues, we propose a stage-wise finetuning pipeline.

Stage-wise training

Through empirical evaluation of stage-wise approaches, we develop VEST (highlighted in blue) that incorporates RA and AP to PFT the visual encoder with our validation strategy in three stages:

Stage 0: Layer selection for PFT with validation.
Stage 1: PFT with both ID few-shot training data and the retrieved data.
Stage 2: PFT with AP on the ID few-shot training data.

VEST achieves state-of-the-art performance

We compare representative VLM adaptation methods categorized into Prompt Tuning (PT), Finetuning (FT), Adapter Learning (AL) and Linear Probing (LP). Our method VEST significantly outperforms existing adaptation methods in both ID and OOD accuracy with a moderate number of learned parameters.

BibTeX


      @article{wang2025enabling,
      title={Enabling Validation for Robust Few-Shot Recognition}, 
      author={Wang, Hanxin and Liu, Tian and Kong, Shu},
      journal={arXiv preprint arXiv:2506.04713},
      year={2025}
      }