Javascript must be enabled to continue!

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images. Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module. V-RAG sets a new standard, with a 9% and 11% improvement in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, compared to the previous best baseline models. Additionally, integrating V-RAG with LMMs enables them to efficiently operate across thousands of images, yielding significant improvements on our DocHaystack and InfoHaystack benchmarks. Our code and datasets are available at https://github.com/Vision-CAIR/dochaystacks

Qeios Ltd

Jun Chen Dannong Xu Junjie Fei Chun-Mei Feng Mohamed Elhoseiny

2024

Title: Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Description:

Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages.

To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding.

Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module.

V-RAG sets a new standard, with a 9% and 11% improvement in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, compared to the previous best baseline models.

Additionally, integrating V-RAG with LMMs enables them to efficiently operate across thousands of images, yielding significant improvements on our DocHaystack and InfoHaystack benchmarks.

Our code and datasets are available at https://github.

com/Vision-CAIR/dochaystacks.

Back

<span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

Blunt Chest Trauma and Chylothorax: A Systematic Review

Abstract Introduction: Although traumatic chylothorax is predominantly associated with penetrating injuries, instances following blunt trauma, as a rare and challenging condition, ...

Theoretical study of laser-cooled SH– anion

The potential energy curves, dipole moments, and transition dipole moments for the <inline-formula><tex-math id="M13">\begin{document}${{\rm{X}}^1}{\Sigma ^ + }$\end{do...

Long Buoyant Flexible Pile Frames for Support of Deepwater Offshore Platforms

ABSTRACT This paper is about Long Buoyant Flexible Pile Frames for support of deep-water offshore platforms. These Long Buoyant Flexible Pile Frame structures com...

Casting and installation of segmental precast quadratic concrete driven geothermal energy piles

Geothermal energy pile foundations are used both for structural purposes and to provide sustainable, clean, and cost-effective ground energy for heating and cooling buildings [1]. ...

Subsea Installations Using Vibratory Piling Hammers

ABSTRACT During the summer of 1987 ICE hydraulic vibratory hammers were used 3 times successfully for subsea installations of anchorpiles and template foundation ...

Revisiting near-threshold photoelectron interference in argon with a non-adiabatic semiclassical model

<sec> Purpose: The interaction of intense, ultrashort laser pulses with atoms gives rise to rich non-perturbative phenomena, which are encoded within th...

FLY ASH FOUNDATION REINFORCED BY CEMENT–SOIL MIXING PILES

Cement-soil mixing piles have been commonly used to enhance the bearing capacity of fly ash stratum and mitigate the settlement damage to the surrounding environment. However, only...

Email:
Password:

Email:

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Related Results