Vectorless RAG for Finance: How PageIndex makes AI audit-ready and reduces numeric hallucinations

Why “vectorless RAG” might be the missing piece for finance AI

TL;DR

  • Vectorless RAG (aka hierarchical, vision-native RAG) preserves document layout—headers, tables, footnotes—so LLMs can reason over structure instead of matching isolated text chunks.
  • VectifyAI’s PageIndex + Mafin 2.5 combine a tree-based index with vision-aware retrieval and live market feeds. VectifyAI reports 98.7% accuracy on FinanceBench (as reported by VectifyAI); independent validation is recommended.
  • For regulated finance teams, the key benefits are traceability and fewer hallucinations on numeric/table questions. Pilot with a small corpus to measure accuracy, latency, and cost before wide rollout.

Glossary: quick definitions for busy leaders

  • RAG (Retrieval-Augmented Generation) — let an LLM answer questions using documents retrieved from an index.
  • Vector DB / embeddings — search by numeric vector similarity of text chunks; great for fuzzy semantic lookup but loses layout context.
  • OCR — optical character recognition that extracts text from images or PDFs; brittle on messy scans.
  • Vision-native RAG — retrieval that works from page images (charts, tables, multi-column layouts), not just OCR text.
  • Vectorless RAG / PageIndex — a hierarchical tree index that preserves document structure so an LLM can reason over pages, tables and footnotes.

The problem: shredding a filing into vectors breaks meaning

Financial filings aren’t free-text essays. They are structured artifacts: nested headers, multi-column tables, row/column relationships, footnotes that change the meaning of a number. Flattening these into fixed-size text chunks for embedding is like shredding a spreadsheet into loose cells—search becomes easy, but context evaporates.

“Building a RAG pipeline that won’t hallucinate during a 10‑K audit is extremely hard—vector chunking often strips away crucial layout context.” — VectifyAI product lead (paraphrase)

When header-to-cell links are lost, simple IR hits can point to the wrong table, the wrong year, or misattribute a subtotal as a consolidated total. For regulated workflows—audits, compliance, investor reports—those mistakes are unacceptable.

What vectorless RAG (vision-native RAG) actually means

Vectorless RAG swaps flat embedding hits for a semantic document tree. PageIndex, the open-source framework behind Mafin 2.5, parses a filing into nodes—document → page → section → table → row → cell → footnote—and indexes those nodes with layout information and visual features. Retrieval is driven by reasoning over that tree and page images, not only by vector similarity of isolated text snippets.

“PageIndex constructs a semantic tree so an LLM can navigate a document like a human analyst.” — VectifyAI (paraphrase)

How it works — a simple walkthrough

Example query: “What was FY2023 revenue for Product X?”

  1. PageIndex narrows the corpus to relevant filings and pages using metadata (report type, date, company).
  2. Vision-native models score candidate pages by layout and visual cues (table proximity, header typography), not just OCR hits.
  3. Within a selected page, the tree index identifies candidate tables by header matching, then narrows to rows/cells where the header aligns with “Product X” and the column aligns with FY2023.
  4. Linked footnotes or adjacent text are retrieved and presented alongside the cell so the LLM can interpret any parenthetical adjustments or restatements.
  5. The LLM generates an answer and cites the node path (document ID → page ID → table ID → row ID → cell ID), giving an auditable retrieval path.

Evidence and claims: what’s impressive, and what to verify

VectifyAI reports that Mafin 2.5, combining PageIndex with multimodal reasoning and market feeds, attains 98.7% accuracy on FinanceBench. By contrast, general-purpose systems tested on the same benchmark scored lower (reported: GPT-4o ~31%, Perplexity ~45%). These numbers highlight how layout-aware, domain-specific systems can outperform “one-size-fits-all” LLM setups on table- and layout-heavy tasks.

That said, benchmarks can be noisy. Leaders should validate:

  • What “accuracy” measures—retrieval accuracy, end-to-end QA correctness, numeric precision or citation recall?
  • Dataset diversity—does the benchmark include scanned PDFs, foreign-language filings, and adversarial layouts?
  • Reproducibility—can your team reproduce results on your own corpus and queries?

Where this wins — and where it doesn’t (yet)

  • Wins: numeric/table reasoning, auditability (node-level citations), legal/compliance use-cases, and any workflow where positional context matters.
  • Limits: fuzzy semantic discovery across unstructured communications (emails, research notes) still benefits from vector search. Poor scan quality, extreme layout diversity, and multilingual reports can require extra tooling.
  • Hybrid reality: most realistic deployments will be hybrid—use PageIndex for filings and tables, and retain vector search for broader semantic discovery.

Operational considerations for leaders

Before swapping an existing stack, plan for:

  • Indexing throughput & storage: hierarchical node metadata plus images increases storage relative to embeddings-only pipelines. Expect larger object stores for page images and metadata DBs for node graphs.
  • Latency & compute: vision models add CPU/GPU cost per page. Measure query latency (p95) against your service-level needs and budget for GPU-backed inference if you need low-latency vision processing.
  • Cost metrics to track: cost per page indexed, cost per query, queries per second, and percentage of queries requiring GPU inference.
  • OCR fallback: vision-native retrieval reduces dependence on brittle OCR, but OCR remains useful for searchable text exports and downstream analytics.
  • Scans & multilingual handling: preflight sample a corpus for poor scans and foreign-language filings. Add extra preprocessing or human QC where needed.
  • Governance & compliance: ensure node-level audit logs, redaction capabilities for PII, retention policies, and change/version tracking for frequently updated filings (e.g., restatements, amended 8-Ks).
  • Interoperability: check how PageIndex integrates with your vector DB and LLM orchestration—hybrid connectors are likely the practical path forward.

Benchmarks and a reproducibility checklist

When running a pilot, capture the variables that matter:

  • Dataset version and sample size (e.g., 100 representative 10‑Ks and 50 earnings transcripts).
  • Query set composition: retrieval-only, numeric QA, table interpretation, footnote linkage.
  • OCR settings and preprocessing steps (if any).
  • Vision model and LLM versions, and hardware used (CPU/GPU).
  • Metrics: end-to-end accuracy, citation coverage (percent of answers with node-level citation), mean time to answer, cost per page indexed.

Pilot checklist — an actionable 6-step plan

  1. Select 100 representative filings (mix of 10‑Ks, 10‑Qs, earnings transcripts, scanned PDFs).
  2. Run your current vector RAG pipeline and PageIndex-based pipeline on the same query set.
  3. Measure: accuracy (end-to-end), percent of answers with node-level citations, mean/95th percentile latency, cost per page indexed.
  4. Audit 50 random answers for numeric correctness and traceability to the cited node path.
  5. Stress-test with poor scans and foreign-language reports; measure degradation and necessary preprocessing.
  6. Decide: full rollout, hybrid architecture, or keep existing stack based on accuracy gains vs. operational cost.

Security, governance and auditability

Traceability is a major selling point: PageIndex returns node paths that make it possible to show exactly which page, table and cell informed an answer. For regulated environments, add:

  • Immutable audit logs recording query, node citations, model versions and timestamps.
  • Redaction and PII handling workflows for filings with sensitive data.
  • Versioning strategy for re-indexing when filings are amended.

Final pragmatic view

PageIndex and Mafin 2.5 show how preserving layout and reasoning over document structure can materially improve financial AI—particularly for table-heavy, audit-sensitive tasks. If independent pilots confirm the reported FinanceBench gains, the approach could reduce manual review and regulatory risk.

That said, vectorless RAG is not a universal replacement for vector search. Hybrid architectures that combine tree-based, vision-native retrieval for filings with embeddings-based search for unstructured communications will likely offer the best balance of precision and breadth.

Quick next steps for leaders

  • Clone the PageIndex GitHub repo and scan the README and architecture notes.
  • Run the 6-step pilot above on a small, representative corpus of your filings.
  • Measure accuracy, citation coverage and cost; audit random responses for numeric correctness.
  • Decide on a hybrid deployment pattern if you need both layout-aware precision and semantic discovery.

Vectorless RAG reframes a simple truth: if a number’s meaning depends on where it sits on a page, your retrieval layer must preserve that place. For finance teams who live and die by positional context, that’s not a neat trick—it’s a required capability for trustworthy AI automation.

Want to test this internally? Start a pilot, collect the metrics above, and share results—PageIndex is open-source, and real-world feedback will shape the next wave of finance-grade AI agents.