Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

View through CrossRef
Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images. Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module. V-RAG sets a new standard, with a 9% and 11% improvement in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, compared to the previous best baseline models. Additionally, integrating V-RAG with LMMs enables them to efficiently operate across thousands of images, yielding significant improvements on our DocHaystack and InfoHaystack benchmarks. Our code and datasets are available at https://github.com/Vision-CAIR/dochaystacks
Title: Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents
Description:
Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images.
Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages.
To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding.
Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module.
V-RAG sets a new standard, with a 9% and 11% improvement in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, compared to the previous best baseline models.
Additionally, integrating V-RAG with LMMs enables them to efficiently operate across thousands of images, yielding significant improvements on our DocHaystack and InfoHaystack benchmarks.
Our code and datasets are available at https://github.
com/Vision-CAIR/dochaystacks.

Related Results

Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Theoretical study of laser-cooled SH<sup>–</sup> anion
Theoretical study of laser-cooled SH<sup>–</sup> anion
The potential energy curves, dipole moments, and transition dipole moments for the <inline-formula><tex-math id="M13">\begin{document}${{\rm{X}}^1}{\Sigma ^ + }$\end{do...
Long Buoyant Flexible Pile Frames for Support of Deepwater Offshore Platforms
Long Buoyant Flexible Pile Frames for Support of Deepwater Offshore Platforms
ABSTRACT This paper is about Long Buoyant Flexible Pile Frames for support of deep-water offshore platforms. These Long Buoyant Flexible Pile Frame structures com...
Casting and installation of segmental precast quadratic concrete driven geothermal energy piles
Casting and installation of segmental precast quadratic concrete driven geothermal energy piles
Geothermal energy pile foundations are used both for structural purposes and to provide sustainable, clean, and cost-effective ground energy for heating and cooling buildings [1]. ...
Subsea Installations Using Vibratory Piling Hammers
Subsea Installations Using Vibratory Piling Hammers
ABSTRACT During the summer of 1987 ICE hydraulic vibratory hammers were used 3 times successfully for subsea installations of anchorpiles and template foundation ...
Revisiting near-threshold photoelectron interference in argon with a non-adiabatic semiclassical model
Revisiting near-threshold photoelectron interference in argon with a non-adiabatic semiclassical model
<sec> <b>Purpose:</b> The interaction of intense, ultrashort laser pulses with atoms gives rise to rich non-perturbative phenomena, which are encoded within th...
FLY ASH FOUNDATION REINFORCED BY CEMENT–SOIL MIXING PILES
FLY ASH FOUNDATION REINFORCED BY CEMENT–SOIL MIXING PILES
Cement-soil mixing piles have been commonly used to enhance the bearing capacity of fly ash stratum and mitigate the settlement damage to the surrounding environment. However, only...

Back to Top