Javascript must be enabled to continue!
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models
View through CrossRef
High-resolution Vision-Language Models (VLMs) are widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate an excessive number of visual tokens due to the need to encode multiple partitions of a high-resolution image input. Processing such a large number of visual tokens poses significant computational challenges, particularly for resource-constrained commodity GPUs. To address this challenge, we propose High-Resolution Early Dropping (HiRED), a plug-and-play token-dropping method designed to operate within a fixed token budget. HiRED leverages the attention of CLS token in the vision transformer (ViT) to assess the visual content of the image partitions and allocate an optimal token budget for each partition accordingly. The most informative visual tokens from each partition within the allocated budget are then selected and passed to the subsequent Large Language Model (LLM). We showed that HiRED achieves superior accuracy and performance, compared to existing token-dropping methods. Empirically, HiRED-20% (i.e., a 20% token budget) on LLaVA-Next-7B achieves a 4.7x increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). For larger batch sizes (e.g., 4), HiRED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits.
Association for the Advancement of Artificial Intelligence (AAAI)
Title: HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models
Description:
High-resolution Vision-Language Models (VLMs) are widely used in multimodal tasks to enhance accuracy by preserving detailed image information.
However, these models often generate an excessive number of visual tokens due to the need to encode multiple partitions of a high-resolution image input.
Processing such a large number of visual tokens poses significant computational challenges, particularly for resource-constrained commodity GPUs.
To address this challenge, we propose High-Resolution Early Dropping (HiRED), a plug-and-play token-dropping method designed to operate within a fixed token budget.
HiRED leverages the attention of CLS token in the vision transformer (ViT) to assess the visual content of the image partitions and allocate an optimal token budget for each partition accordingly.
The most informative visual tokens from each partition within the allocated budget are then selected and passed to the subsequent Large Language Model (LLM).
We showed that HiRED achieves superior accuracy and performance, compared to existing token-dropping methods.
Empirically, HiRED-20% (i.
e.
, a 20% token budget) on LLaVA-Next-7B achieves a 4.
7x increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB).
For larger batch sizes (e.
g.
, 4), HiRED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits.
Related Results
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...
Analisa Penerapan Algoritma Keccak untuk Keamanan Permintaan API
Analisa Penerapan Algoritma Keccak untuk Keamanan Permintaan API
Implementing REST in modern applications, security will be a key foundation for its development because the REST architecture requires communication between servers. In this study,...
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Abstract
Funding Acknowledgements
Type of funding sources: None.
INTRODUCTION Patients with heart failure (HF)...
Evolutionary Grammatical Inference
Evolutionary Grammatical Inference
Grammatical Inference (also known as grammar induction) is the problem of learning a grammar for a language from a set of examples. In a broad sense, some data is presented to the ...
Depth-aware salient object segmentation
Depth-aware salient object segmentation
Object segmentation is an important task which is widely employed in many computer vision applications such as object detection, tracking, recognition, and ret...
Advancing Transformer Efficiency with Token Pruning
Advancing Transformer Efficiency with Token Pruning
Transformer-based models have revolutionized natural language processing (NLP), achieving state-of-the-art performance across a wide range of tasks. However, their high computation...
Impact of Dropping on Postharvest Physiology of Tomato Fruits Harvested at Green and Red Ripeness Stages
Impact of Dropping on Postharvest Physiology of Tomato Fruits Harvested at Green and Red Ripeness Stages
Dropping during transportation is a critical issue for tomato fruits, as it triggers ethylene production and affects quality parameters, leading to lower quality and a reduced stor...

