Javascript must be enabled to continue!

Two-Stage Fine-tuning CLIP by Introducing Structure Knowledge for Few-shot Classification

Abstract Abstract concepts containing structural information, such as tangram, are often used in cognitive psychology to explore spatial reasoning and visual cognition. Inspired by this way, we propose a newly simple but effective fine-tuning method called two-stage fine-tuning contrastive language-image pretraining (TSF-CLIP). In stage I, the CLIP encoder is fine-tuned by the image-text matching task set on the tangram dataset, and the structure prior knowledge is captured by both two encoders of CLIP via a fine-tuning process. In stage II, to further improve the accuracy, the linear head is aligned to the domain of the specific downstream task. The proposed TSF-CLIP method can not only dynamically integrate structural prior knowledge and semantic information, but also avoid the shortcomings of large models with poor spatial logical reasoning ability and excessive fine-tuning parameters, and enhances the adaptability of the model to different downstream tasks. Experimental results demonstrate that the proposed TSF-CLIP substantially boosts the performance of the target model, which outperforms the existing few-shot image classification approach. Compared to the traditional CLIP, the average accuracy of the proposed TSF-CLIP is improved by 16.1% across 10 image recognition datasets. The code and related datasets can be found at https://github.com/Patrickeroo/TSF-CLIP.

Springer Science and Business Media LLC

Zhe Zhang Xiang-Gui Guo Junbao Zhuo Huimin Ma

2025

Title: Two-Stage Fine-tuning CLIP by Introducing Structure Knowledge for Few-shot Classification

Description:

Abstract Abstract concepts containing structural information, such as tangram, are often used in cognitive psychology to explore spatial reasoning and visual cognition.

Inspired by this way, we propose a newly simple but effective fine-tuning method called two-stage fine-tuning contrastive language-image pretraining (TSF-CLIP).

In stage I, the CLIP encoder is fine-tuned by the image-text matching task set on the tangram dataset, and the structure prior knowledge is captured by both two encoders of CLIP via a fine-tuning process.

In stage II, to further improve the accuracy, the linear head is aligned to the domain of the specific downstream task.

The proposed TSF-CLIP method can not only dynamically integrate structural prior knowledge and semantic information, but also avoid the shortcomings of large models with poor spatial logical reasoning ability and excessive fine-tuning parameters, and enhances the adaptability of the model to different downstream tasks.

Experimental results demonstrate that the proposed TSF-CLIP substantially boosts the performance of the target model, which outperforms the existing few-shot image classification approach.

Compared to the traditional CLIP, the average accuracy of the proposed TSF-CLIP is improved by 16.

1% across 10 image recognition datasets.

The code and related datasets can be found at https://github.

com/Patrickeroo/TSF-CLIP.

Back

The quasi-phase matching optical parametric oscillator tuning methods, i.e. grating period tuning, temperature tuning, pumping wavelength tuning, and angle tuning are more simple a...

Progressive Layer Activation CLIP for Few-Shot and Generalizable Cassava Disease Recognition

Abstract Cassava diseases such as Cassava Mosaic Disease (CMD), Cassava Brown Streak Disease (CBSD), and Cassava Bacterial Blight (CBB) pose serious threats to glob...

EMNet: A Novel Few-Shot Image Classification Model with Enhanced Self-Correlation Attention and Multi-Branch Joint Module

In this research, inspired by the principles of biological visual attention mechanisms and swarm intelligence found in nature, we present an Enhanced Self-Correlation Attention and...

VCP-CLIP+: Stabilizing and Optimizing VCP-CLIP with Minimal Architectural Changes

Zero-shot anomaly segmentation (ZSAS) has significantly advanced with the emergence of vision–language models such as CLIP. Among recent approaches for ZSAS, VCP-CLIP introduced vi...

Depression subtype classification from social media posts: few-shot prompting vs. fine-tuning of large language models

Background Social media provides timely proxy signals of mental health, but reliable tweet-level classification of depression subtypes remains challenging due t...

ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization

Pre-trained vision-language(V-L) models such as CLIP have demonstrated impressive Zero-Shot performance in many downstream tasks. Since adopting contrastive video-text pairs method...

PC-CLIP: Probabilistic Calibration CLIP for Zero-Shot Semantic Segmentation

Zero-shot semantic segmentation aims to extend the ability of aligning pixels with pre-defined classes to novel unseen classes. Most previous CLIP-based methods use static prompts ...

The CLAMP-Linked Invasion Protein (CLIP) plays an essential role in Plasmodium berghei zoites

Abstract Apicomplexan parasites such as Toxoplasma and Plasmodium spp. re...

Email:
Password:

Email:

Two-Stage Fine-tuning CLIP by Introducing Structure Knowledge for Few-shot Classification

Related Results