Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

View through CrossRef
Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs. Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers. To pinpoint the ideal layer for VTW, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion. Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
Title: Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
Description:
Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation.
Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference.
Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs.
As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs.
Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers.
To pinpoint the ideal layer for VTW, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion.
Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.

Related Results

Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Evolutionary Grammatical Inference
Evolutionary Grammatical Inference
Grammatical Inference (also known as grammar induction) is the problem of learning a grammar for a language from a set of examples. In a broad sense, some data is presented to the ...
Prediction of screw withdrawal resistance for plywood laminated panels and sandwich panels
Prediction of screw withdrawal resistance for plywood laminated panels and sandwich panels
Sandwich panels are favorable materials for structural or non-structural components due to durability, lightness, and longevity in service life. This study aimed to predict screw w...
Regulation of Blockchain Token Sales in the United States
Regulation of Blockchain Token Sales in the United States
Abstract This chapter provides an overview of how US securities regulation applies to the sale of cryptographic tokens using a distributed ledger, so-called initial ...
AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model
AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model
Multimodal sentiment analysis is an essential task in natural language processing which refers to the fact that machines can analyze and recognize emotions through logical reasonin...
Dual Token Blockchains
Dual Token Blockchains
It is standard for blockchain platforms to issue native tokens, currencies, that users must own to operate within the platform. Some blockchains however decided to issue two tokens...

Back to Top