Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Two-Stage Fine-tuning CLIP by Introducing Structure Knowledge for Few-shot Classification

View through CrossRef
Abstract Abstract concepts containing structural information, such as tangram, are often used in cognitive psychology to explore spatial reasoning and visual cognition. Inspired by this way, we propose a newly simple but effective fine-tuning method called two-stage fine-tuning contrastive language-image pretraining (TSF-CLIP). In stage I, the CLIP encoder is fine-tuned by the image-text matching task set on the tangram dataset, and the structure prior knowledge is captured by both two encoders of CLIP via a fine-tuning process. In stage II, to further improve the accuracy, the linear head is aligned to the domain of the specific downstream task. The proposed TSF-CLIP method can not only dynamically integrate structural prior knowledge and semantic information, but also avoid the shortcomings of large models with poor spatial logical reasoning ability and excessive fine-tuning parameters, and enhances the adaptability of the model to different downstream tasks. Experimental results demonstrate that the proposed TSF-CLIP substantially boosts the performance of the target model, which outperforms the existing few-shot image classification approach. Compared to the traditional CLIP, the average accuracy of the proposed TSF-CLIP is improved by 16.1% across 10 image recognition datasets. The code and related datasets can be found at https://github.com/Patrickeroo/TSF-CLIP.
Springer Science and Business Media LLC
Title: Two-Stage Fine-tuning CLIP by Introducing Structure Knowledge for Few-shot Classification
Description:
Abstract Abstract concepts containing structural information, such as tangram, are often used in cognitive psychology to explore spatial reasoning and visual cognition.
Inspired by this way, we propose a newly simple but effective fine-tuning method called two-stage fine-tuning contrastive language-image pretraining (TSF-CLIP).
In stage I, the CLIP encoder is fine-tuned by the image-text matching task set on the tangram dataset, and the structure prior knowledge is captured by both two encoders of CLIP via a fine-tuning process.
In stage II, to further improve the accuracy, the linear head is aligned to the domain of the specific downstream task.
The proposed TSF-CLIP method can not only dynamically integrate structural prior knowledge and semantic information, but also avoid the shortcomings of large models with poor spatial logical reasoning ability and excessive fine-tuning parameters, and enhances the adaptability of the model to different downstream tasks.
Experimental results demonstrate that the proposed TSF-CLIP substantially boosts the performance of the target model, which outperforms the existing few-shot image classification approach.
Compared to the traditional CLIP, the average accuracy of the proposed TSF-CLIP is improved by 16.
1% across 10 image recognition datasets.
The code and related datasets can be found at https://github.
com/Patrickeroo/TSF-CLIP.

Related Results

Electric field tuning characteristic of multiple optical parametric oscillator based on MgO:QPLN
Electric field tuning characteristic of multiple optical parametric oscillator based on MgO:QPLN
The quasi-phase matching optical parametric oscillator tuning methods, i.e. grating period tuning, temperature tuning, pumping wavelength tuning, and angle tuning are more simple a...
Progressive Layer Activation CLIP for Few-Shot and Generalizable Cassava Disease Recognition
Progressive Layer Activation CLIP for Few-Shot and Generalizable Cassava Disease Recognition
Abstract Cassava diseases such as Cassava Mosaic Disease (CMD), Cassava Brown Streak Disease (CBSD), and Cassava Bacterial Blight (CBB) pose serious threats to glob...
EMNet: A Novel Few-Shot Image Classification Model with Enhanced Self-Correlation Attention and Multi-Branch Joint Module
EMNet: A Novel Few-Shot Image Classification Model with Enhanced Self-Correlation Attention and Multi-Branch Joint Module
In this research, inspired by the principles of biological visual attention mechanisms and swarm intelligence found in nature, we present an Enhanced Self-Correlation Attention and...
VCP-CLIP+: Stabilizing and Optimizing VCP-CLIP with Minimal Architectural Changes
VCP-CLIP+: Stabilizing and Optimizing VCP-CLIP with Minimal Architectural Changes
Zero-shot anomaly segmentation (ZSAS) has significantly advanced with the emergence of vision–language models such as CLIP. Among recent approaches for ZSAS, VCP-CLIP introduced vi...
Depression subtype classification from social media posts: few-shot prompting vs. fine-tuning of large language models
Depression subtype classification from social media posts: few-shot prompting vs. fine-tuning of large language models
Background Social media provides timely proxy signals of mental health, but reliable tweet-level classification of depression subtypes remains challenging due t...
ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization
ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization
Pre-trained vision-language(V-L) models such as CLIP have demonstrated impressive Zero-Shot performance in many downstream tasks. Since adopting contrastive video-text pairs method...
PC-CLIP: Probabilistic Calibration CLIP for Zero-Shot Semantic Segmentation
PC-CLIP: Probabilistic Calibration CLIP for Zero-Shot Semantic Segmentation
Zero-shot semantic segmentation aims to extend the ability of aligning pixels with pre-defined classes to novel unseen classes. Most previous CLIP-based methods use static prompts ...

Back to Top