Javascript must be enabled to continue!
Two-Stage Fine-tuning CLIP by Introducing Structure Knowledge for Few-shot Classification
View through CrossRef
Abstract
Abstract concepts containing structural information, such as tangram, are often used in cognitive psychology to explore spatial reasoning and visual cognition. Inspired by this way, we propose a newly simple but effective fine-tuning method called two-stage fine-tuning contrastive language-image pretraining (TSF-CLIP). In stage I, the CLIP encoder is fine-tuned by the image-text matching task set on the tangram dataset, and the structure prior knowledge is captured by both two encoders of CLIP via a fine-tuning process. In stage II, to further improve the accuracy, the linear head is aligned to the domain of the specific downstream task. The proposed TSF-CLIP method can not only dynamically integrate structural prior knowledge and semantic information, but also avoid the shortcomings of large models with poor spatial logical reasoning ability and excessive fine-tuning parameters, and enhances the adaptability of the model to different downstream tasks. Experimental results demonstrate that the proposed TSF-CLIP substantially boosts the performance of the target model, which outperforms the existing few-shot image classification approach. Compared to the traditional CLIP, the average accuracy of the proposed TSF-CLIP is improved by 16.1% across 10 image recognition datasets. The code and related datasets can be found at https://github.com/Patrickeroo/TSF-CLIP.
Title: Two-Stage Fine-tuning CLIP by Introducing Structure Knowledge for Few-shot Classification
Description:
Abstract
Abstract concepts containing structural information, such as tangram, are often used in cognitive psychology to explore spatial reasoning and visual cognition.
Inspired by this way, we propose a newly simple but effective fine-tuning method called two-stage fine-tuning contrastive language-image pretraining (TSF-CLIP).
In stage I, the CLIP encoder is fine-tuned by the image-text matching task set on the tangram dataset, and the structure prior knowledge is captured by both two encoders of CLIP via a fine-tuning process.
In stage II, to further improve the accuracy, the linear head is aligned to the domain of the specific downstream task.
The proposed TSF-CLIP method can not only dynamically integrate structural prior knowledge and semantic information, but also avoid the shortcomings of large models with poor spatial logical reasoning ability and excessive fine-tuning parameters, and enhances the adaptability of the model to different downstream tasks.
Experimental results demonstrate that the proposed TSF-CLIP substantially boosts the performance of the target model, which outperforms the existing few-shot image classification approach.
Compared to the traditional CLIP, the average accuracy of the proposed TSF-CLIP is improved by 16.
1% across 10 image recognition datasets.
The code and related datasets can be found at https://github.
com/Patrickeroo/TSF-CLIP.
Related Results
Electric field tuning characteristic of multiple optical parametric oscillator based on MgO:QPLN
Electric field tuning characteristic of multiple optical parametric oscillator based on MgO:QPLN
The quasi-phase matching optical parametric oscillator tuning methods, i.e. grating period tuning, temperature tuning, pumping wavelength tuning, and angle tuning are more simple a...
Progressive Layer Activation CLIP for Few-Shot and Generalizable Cassava Disease Recognition
Progressive Layer Activation CLIP for Few-Shot and Generalizable Cassava Disease Recognition
Abstract
Cassava diseases such as Cassava Mosaic Disease (CMD), Cassava Brown Streak Disease (CBSD), and Cassava Bacterial Blight (CBB) pose serious threats to glob...
EMNet: A Novel Few-Shot Image Classification Model with Enhanced Self-Correlation Attention and Multi-Branch Joint Module
EMNet: A Novel Few-Shot Image Classification Model with Enhanced Self-Correlation Attention and Multi-Branch Joint Module
In this research, inspired by the principles of biological visual attention mechanisms and swarm intelligence found in nature, we present an Enhanced Self-Correlation Attention and...
VCP-CLIP+: Stabilizing and Optimizing VCP-CLIP with Minimal Architectural Changes
VCP-CLIP+: Stabilizing and Optimizing VCP-CLIP with Minimal Architectural Changes
Zero-shot anomaly segmentation (ZSAS) has significantly advanced with the emergence of vision–language models such as CLIP. Among recent approaches for ZSAS, VCP-CLIP introduced vi...
Depression subtype classification from social media posts: few-shot prompting vs. fine-tuning of large language models
Depression subtype classification from social media posts: few-shot prompting vs. fine-tuning of large language models
Background
Social media provides timely proxy signals of mental health, but reliable tweet-level classification of depression subtypes remains challenging due t...
ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization
ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization
Pre-trained vision-language(V-L) models such as CLIP have demonstrated impressive Zero-Shot performance in many downstream tasks. Since adopting contrastive video-text pairs method...
PC-CLIP: Probabilistic Calibration CLIP for Zero-Shot Semantic Segmentation
PC-CLIP: Probabilistic Calibration CLIP for Zero-Shot Semantic Segmentation
Zero-shot semantic segmentation aims to extend the ability of aligning pixels with pre-defined classes to novel unseen classes. Most previous CLIP-based methods use static prompts ...
The CLAMP-Linked Invasion Protein (CLIP) plays an essential role in
Plasmodium berghei
zoites
The CLAMP-Linked Invasion Protein (CLIP) plays an essential role in
Plasmodium berghei
zoites
Abstract
Apicomplexan parasites such as
Toxoplasma
and
Plasmodium
spp. re...

