Javascript must be enabled to continue!
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
View through CrossRef
Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs. Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers. To pinpoint the ideal layer for VTW, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion. Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
Association for the Advancement of Artificial Intelligence (AAAI)
Title: Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
Description:
Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation.
Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference.
Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs.
As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs.
Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers.
To pinpoint the ideal layer for VTW, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion.
Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
Related Results
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)
Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)
BACKGROUND
Mental health has become one of the most urgent global health issues of the twenty-first century. The World Health Organization (WHO) reports tha...
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Abstract
Funding Acknowledgements
Type of funding sources: None.
INTRODUCTION Patients with heart failure (HF)...
Evolutionary Grammatical Inference
Evolutionary Grammatical Inference
Grammatical Inference (also known as grammar induction) is the problem of learning a grammar for a language from a set of examples. In a broad sense, some data is presented to the ...
Imagined worldviews in John Lennon’s “Imagine”: a multimodal re-performance / Visões de mundo imaginadas no “Imagine” de John Lennon: uma re-performance multimodal
Imagined worldviews in John Lennon’s “Imagine”: a multimodal re-performance / Visões de mundo imaginadas no “Imagine” de John Lennon: uma re-performance multimodal
Abstract: This paper addresses the issue of multimodal re-performance, a concept developed by us, in view of the fact that the famous song “Imagine”, by John Lennon, was published ...
Literasi Multimodal: Teori, Desain, dan Aplikasi
Literasi Multimodal: Teori, Desain, dan Aplikasi
Buku ini bertujuan untuk pengembangan strategi dan model paket pelajaran atau mata kuliah dengan menawarkan contoh-contoh strategi instruksional yang memiliki landasan teori dan be...
Unanticipated Benzodiazepine Withdrawal in the Context of an Adulterated Unregulated Opioid Supply in Vancouver, BC: A Case Series
Unanticipated Benzodiazepine Withdrawal in the Context of an Adulterated Unregulated Opioid Supply in Vancouver, BC: A Case Series
ABSTRACT
Background:
Novel psychoactive substance (NPS) benzodiazepines have emerged as frequent adulterants of the unregulated opioid supply in ...

