Javascript must be enabled to continue!

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models

High-resolution Vision-Language Models (VLMs) are widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate an excessive number of visual tokens due to the need to encode multiple partitions of a high-resolution image input. Processing such a large number of visual tokens poses significant computational challenges, particularly for resource-constrained commodity GPUs. To address this challenge, we propose High-Resolution Early Dropping (HiRED), a plug-and-play token-dropping method designed to operate within a fixed token budget. HiRED leverages the attention of CLS token in the vision transformer (ViT) to assess the visual content of the image partitions and allocate an optimal token budget for each partition accordingly. The most informative visual tokens from each partition within the allocated budget are then selected and passed to the subsequent Large Language Model (LLM). We showed that HiRED achieves superior accuracy and performance, compared to existing token-dropping methods. Empirically, HiRED-20% (i.e., a 20% token budget) on LLaVA-Next-7B achieves a 4.7x increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). For larger batch sizes (e.g., 4), HiRED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits.

Association for the Advancement of Artificial Intelligence (AAAI)

Kazi Hasan Ibn Arif JinYi Yoon Dimitrios S. Nikolopoulos Hans Vandierendonck Deepu John Bo Ji

Proceedings of the AAAI Conference on Artificial Intelligence

2025

Title: HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models

Description:

High-resolution Vision-Language Models (VLMs) are widely used in multimodal tasks to enhance accuracy by preserving detailed image information.

However, these models often generate an excessive number of visual tokens due to the need to encode multiple partitions of a high-resolution image input.

Processing such a large number of visual tokens poses significant computational challenges, particularly for resource-constrained commodity GPUs.

To address this challenge, we propose High-Resolution Early Dropping (HiRED), a plug-and-play token-dropping method designed to operate within a fixed token budget.

HiRED leverages the attention of CLS token in the vision transformer (ViT) to assess the visual content of the image partitions and allocate an optimal token budget for each partition accordingly.

The most informative visual tokens from each partition within the allocated budget are then selected and passed to the subsequent Large Language Model (LLM).

We showed that HiRED achieves superior accuracy and performance, compared to existing token-dropping methods.

Empirically, HiRED-20% (i.

, a 20% token budget) on LLaVA-Next-7B achieves a 4.

7x increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB).

For larger batch sizes (e.

, 4), HiRED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits.

Back

<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

Evolutionary Grammatical Inference

Grammatical Inference (also known as grammar induction) is the problem of learning a grammar for a language from a set of examples. In a broad sense, some data is presented to the ...

Depth-aware salient object segmentation

Object segmentation is an important task which is widely employed in many computer vision applications such as object detection, tracking, recognition, and ret...

Advancing Transformer Efficiency with Token Pruning

Transformer-based models have revolutionized natural language processing (NLP), achieving state-of-the-art performance across a wide range of tasks. However, their high computation...

Enhancing Non-Formal Learning Certificate Classification with Text Augmentation: A Comparison of Character, Token, and Semantic Approaches

Aim/Purpose: The purpose of this paper is to address the gap in the recognition of prior learning (RPL) by automating the classification of non-formal learning certificates using d...

Token Pruning for Efficient NLP, Vision, and Speech Models

The rapid growth of Transformer-based architectures has led to significant advancements in natural language processing (NLP), computer vision, and speech processing. However, their...

A Wideband mm-Wave Printed Dipole Antenna for 5G Applications

<span lang="EN-MY">In this paper, a wideband millimeter-wave (mm-Wave) printed dipole antenna is proposed to be used for fifth generation (5G) communications. The single elem...

Vision-specific and psychosocial impacts of low vision among patients with low vision at the eastern regional Low Vision Centre

Purpose: To determine vision-specific and psychosocial implications of low vision among patients with low vision visiting the Low Vision Centre of the Eastern Regional Hospital in ...

Email:
Password:

Email:

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models

Related Results