EN KO
← All Publications

Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy

ACL 2026 Findings
Juyeon Kim, Geon Lee, Dongwon Choi, Taeuk Kim, Kijung Shin

One-Line Summary

HEAVEN is a two-stage hybrid retrieval framework that introduces Visually-Summarized Pages and query token filtering to achieve 99.87% of multi-vector Recall@1 while reducing per-query computation by 99.82%, enabling scalable retrieval over visually rich documents.

Efficiency-accuracy trade-off comparison showing HEAVEN's position
Figure 1. Efficiency-accuracy trade-off comparison showing HEAVEN's position relative to single-vector and multi-vector baselines.

Background & Motivation

Visually rich documents—PDFs containing tables, charts, figures, and complex layouts—are central to legal discovery, scientific search, and enterprise knowledge management. Traditional text-based retrieval fails on such content because crucial information is embedded in visual elements, not extractable text. Large Vision-Language Models (LVLMs) have enabled a new paradigm of direct image-based page encoding that bypasses OCR entirely, but existing approaches present a stark trade-off:

A key observation motivates HEAVEN: the performance gap between the two paradigms shrinks dramatically when retrieving larger candidate sets. On ViMDoc, the gap is 22.5% at Recall@1 but narrows to only 0.63% at Recall@200. This suggests that single-vector methods can reliably identify a broad set of candidates, which a targeted multi-vector reranker can then refine.

Core Insight: Single-vector retrieval already captures most relevant pages at moderate recall depths. By combining efficient single-vector candidate generation with focused multi-vector reranking on only key query tokens, we can achieve near-optimal accuracy at a fraction of the cost.

Proposed Method: HEAVEN Framework

HEAVEN (Hybrid-vector retrieval for Efficient and Accurate Visual multi-documENt) is a two-stage framework with two key innovations: Visually-Summarized Pages (VS-Pages) that reduce index size while preserving visual information, and POS-based query token filtering that eliminates redundant multi-vector computation.

HEAVEN pipeline overview
Figure 2. Overview of the HEAVEN pipeline: Stage 1 retrieves candidates via single-vector matching over VS-Pages, then Stage 2 reranks using multi-vector scoring with filtered query tokens.
1
Stage 1: Candidate Retrieval via VS-Pages
VS-Page Construction: DocLayout-YOLO extracts title regions from each document page. Title layouts are grouped (reduction factor r = min(15, |D_k|)) and assembled vertically into composite VS-Pages, each summarizing the visual content of multiple source pages. This reduces the index size while preserving informative visual elements.

Candidate Scoring: Single-vector similarity S_SV(q, VS) = ⟨E_q, E_VS⟩ is computed over all VS-Pages. The top p1 × 100% candidates (default p1 = 0.5) are retained, expanded to their constituent pages, and refined using a combined score: S(q, P) = α · S_SV(q, Γ^{-1}(P)) + (1-α) · S_SV(q, P), with α = 0.1. The top K = 200 pages proceed to Stage 2.
2
Stage 2: Multi-Vector Reranking with Token Filtering
Key Token Filtering: POS tagging (via NLTK) identifies linguistically important tokens—nouns and named entities—which constitute approximately 30% of query tokens. Only these key tokens participate in the expensive MaxSim computation, reducing FLOPs by ~70%.

Reranking: Filtered multi-vector scoring S_MV(q_key, P) = Σ_i max_j ⟨E_{q_key}^{(i)}, E_P^{(j)}⟩ reranks the K candidates. A final refinement step combines scores: S(q, P) = β · S_SV(q, P) + (1-β) · S_MV(q, P), with β = 0.3 by default, using the top p2 = 25% candidates scored with all query tokens for the final output.

ViMDoc Benchmark

The paper introduces ViMDoc, the first benchmark designed for visually rich, multi-document, long-document retrieval. Existing VDR benchmarks either restrict evaluation to single documents or use short documents, failing to capture the realistic challenge of retrieving across large document collections.

Experimental Results

HEAVEN uses DSE for Stage 1 single-vector retrieval and ColQwen2.5 for Stage 2 multi-vector reranking. Results are evaluated on four benchmarks using page-level Recall@{1,3} and per-query FLOPs.

Main Results (vs. ColQwen2.5 multi-vector baseline)

DatasetMethodRecall@1Recall@3FLOPs (B)
ViMDocDSE (single-vec)58.0377.080.235
ColQwen2.5 (multi-vec)71.1386.39407.320
HEAVEN71.0586.410.486
OpenDocVQADSE (single-vec)59.3875.820.247
ColQwen2.5 (multi-vec)72.6386.38482.049
HEAVEN71.5684.530.541
ViDoSeekDSE (single-vec)69.5387.130.017
ColQwen2.5 (multi-vec)75.5791.9441.514
HEAVEN75.0491.330.623
M3DocVQADSE (single-vec)55.1471.300.126
ColQwen2.5 (multi-vec)57.9978.73288.507
HEAVEN59.3178.660.545

Efficiency Analysis (ViMDoc)

MethodLatency (sec/query)FLOPs (B)
DSE (single-vec)0.1150.235
ColQwen2.5 (multi-vec)2006.361407.320
HEAVEN2.4120.486

Ablation Study Highlights

Why It Matters

As enterprises manage millions of visually rich PDF documents for legal discovery, scientific search, and knowledge management, the computational cost of multi-vector retrieval (over 2,000 seconds per query) makes it impractical at production scale. HEAVEN solves this by delivering equivalent accuracy in just 2.4 seconds per query—an 832x speedup. The framework is modular: its VS-Page construction and query token filtering techniques are model-agnostic and can be applied on top of any single-vector/multi-vector model pair. Additionally, the introduced ViMDoc benchmark fills a critical gap by enabling realistic evaluation of retrieval systems across multiple long, visually complex documents—a setting that prior benchmarks did not address.

Links

Information Retrieval Multimodal