PDF-to-JSON in 2026: How to Choose Open-Source Schema Extraction or Document Parsing

Structured PDF-to-JSON: A practical guide to open-source extraction in 2026

Your contracts, invoices and research notes live in PDFs and slide decks. The engineering choice that will make or break a project isn’t “which model?” so much as “which problem are we solving?” Two different technical needs sit under the shorthand “PDF to JSON, ” and picking the wrong one wastes time, money and trust.

Start by splitting the problem

There are two distinct tasks that require different tooling and success metrics:

  • Schema-driven extraction: You supply a JSON schema (invoice_number, total, due_date) and the extractor returns values for those fields. Primary metric is per-field accuracy. For SLAs you also need full-document accuracy, the share of documents where every required field is correct.
  • Document parsing: The goal is faithful reconstruction of reading order, layout, tables, formulas and code blocks into Markdown, HTML or lossless JSON for indexing and analytics. Primary metric is structural fidelity, for example table correctness, reading-order preservation and chunking quality for RAG.

Think of schema extraction like a keyed lock: either the extractor opens the specified fields or it doesn’t. Parsers are more like scanning a room for objects, they give a richer scene but require more normalization to become reliable facts.

Why open weights and local models still matter

Enterprise documents often contain sensitive PII, contracts, or regulated finance data. Local hosting or open weights matter for two reasons: privacy/compliance and cost control at scale. Proprietary hosted APIs can be convenient, but sending thousands or millions of pages off-premises creates compliance headaches and potentially large recurring bills. Open-source options reduce per-page fees and give you more control, at the price of integration work, operational effort and license review.

Licenses are important operational constraints. Several projects publish model cards and license notes that include startup thresholds or additional conditions. Read the LICENSE and model-card files and get legal sign-off before commercial deployment.

Nine open-source projects to know (short, practical notes)

Grouped by primary use: schema extraction tools first, then parsers and general OCR/multimodal models. Each entry: what it does, one headline stat reported by the project, and the practical takeaway.

Schema-driven extractors

lift (Datalab)

What: A 9B vision model built for schema-constrained decoding. Feed a JSON schema, get populated JSON. Headline (reported by Datalab): 90.2% field accuracy on a 225-document internal benchmark, median latency ~9.5s, full-document accuracy 20.9% (Datalab’s internal benchmark). License note: code Apache-2.0. Weights use a modified OpenRAIL-M variant with startup thresholds reported by the project. Practical takeaway: excellent per-field performance on Datalab’s suite, but full-document success can lag. Test on your templates and verify license terms before production use.

NuExtract 3 (NuMind)

What: A ~4B multimodal model (Qwen backbone) that combines structured extraction and content parsing. Headline (reported): ~81.5% field accuracy in comparative tables. Integration: serves via vLLM with an OpenAI-compatible API and offers a Python SDK. Practical takeaway: a compact schema-capable model that’s easy to serve. Confirm the model card and license for your commercial use case.

Document parsers (layout, tables, reading order)

Docling (IBM / LF AI & Data Foundation)

What: A pipeline focused on preserving layout and reading order and producing Markdown, HTML, lossless JSON and DocTags (equations as LaTeX where relevant). Headline: MIT-licensed project with an enterprise managed offering via IBM’s watsonx. Practical takeaway: a good choice when fidelity for RAG indexing and table/formula preservation matters. Consider the managed option if you need SLA and integration help.

Granite-Docling-258M (IBM)

What: A lightweight 258M VLM used inside Docling for one-shot conversions. Headline (reported): ~0.35s per page on an A100. Practical takeaway: fast and low-cost parser for many native PDFs, ideal when throughput and latency matter.

MinerU / MinerU2.5-Pro (OpenDataLab / Shanghai AI Laboratory)

What: ~1.2B VLM focused on high-resolution parsing, cross-page tables and charts. Headline: MinerU’s changelog documents a license change from AGPL-3.0 to a custom “MinerU Open Source License” based on Apache-2.0 to ease commercial deployment. Practical takeaway: strong at complex, multi-page table merges. Review the custom license closely despite the Apache lineage.

OCR-specialized and multimodal generalists

Marker (Datalab)

What: A conversion pipeline that accepts PDF/image/PPTX/DOCX/XLSX/HTML/EPUB with an optional LLM enhancement flag (–use_llm). Headline (reported): ~76.1 on the olmOCR-Bench for the baseline Marker pipeline; code uses GPL-3.0 while published weights use a modified OpenRAIL-M variant with startup thresholds. Practical takeaway: broad-format support and a good option if you want an all-in-one pipeline. License constraints merit legal review.

olmOCR 2 (Allen Institute for AI / Ai2)

What: A 7B OCR-specialized VLM trained with reinforcement learning from verifiable rewards. Headline (reported by Ai2): 82.4 on olmOCR-Bench; Ai2 published a rough hosting cost estimate of about $178 per million pages on your own GPUs. Practical takeaway: strong OCR accuracy (English-focused). Use Ai2’s cost estimate as a planning baseline and validate multilingual behavior for your locales.

DeepSeek-OCR / DeepSeek-OCR2 (DeepSeek)

What: 3B mixture-of-experts OCR models (Oct 2025 and Jan 2026 releases) introducing a technique they call “contexts optical compression” to compactly represent text-rich pages. Headline (reported): MoE decoder activates large expert subsets per token (project-reported figures describe hundreds of millions of active parameters) and the project claims 100+ language support. Practical takeaway: promising for high-density multilingual pages, but expect MoE serving complexity and verify tail-latency behavior for your throughput needs.

Qwen3-VL (Alibaba)

What: General multimodal series (sizes from a few billion to hundreds of billions of parameters). Headline: many sizes released under Apache-2.0 licensing; the VL models can be prompted to return Markdown/JSON but require more prompt engineering than purpose-built extractors. Practical takeaway: a flexible fallback when you need one model to handle many formats, but expect more post-processing to reach production-grade structure.

Benchmarks: numbers are directional, use your own data

Benchmarks are useful signals but rarely directly comparable. Differences in datasets, like native PDF versus scanned images, metrics such as per-field versus full-document, languages and preprocessing all matter. A few reported numbers illustrate the spread and show that context matters:

  • lift (Datalab): 90.2% field accuracy on Datalab’s 225-document internal benchmark; median latency ~9.5s; full-document accuracy 20.9% (Datalab’s internal suite).
  • NuExtract 3: ~81.5% field accuracy in comparative tables reported by the project.
  • olmOCR 2 (Ai2): 82.4 on olmOCR-Bench (English-focused).
  • Marker: ~76.1 on olmOCR-Bench for the baseline pipeline.
  • Qwen3.5-9B and other generalists report lower schema-extraction performance unless guided with careful prompting; proprietary models can report higher scores on certain benchmarks.

Operational rule: treat public numbers as directional and run a short, representative evaluation on your documents. A model’s headline score often hides high variance across templates, languages and scanned-quality levels.

Integration, deployment and operational realities

  • Serving stacks: vLLM, Hugging Face runtimes and containerized gRPC services are common. Several projects explicitly recommend vLLM for lower-latency local serving.
  • MoE trade-offs: Mixture-of-experts models reduce average per-token compute, but they introduce routing complexity and variable tail latencies. Plan capacity accordingly.
  • Licensing action item: Capture the model-card, LICENSE file and any vendor statements before downloading weights. Modified OpenRAIL-M variants and custom licenses can include startup revenue thresholds or “no-competitive-hosting” clauses. Have legal review them.
  • Cost planning: Use Ai2’s published $178 per million pages as a ballpark when evaluating GPU hosting economics, but adjust for OCR pre/post-processing, batching, template complexity and human-in-loop overhead.
  • Hybrid pipelines win: A robust production flow is usually OCR → parser → schema extractor → human-in-the-loop for low-confidence docs. This minimizes false positives and creates a feedback loop for model tuning.
  • Data controls: Even with local weights, treat parsed outputs as sensitive, encrypt storage, audit access, and document retention rules for regulated data.

A concise evaluation recipe (what to run this week)

  1. Assemble a representative sample of your documents (200 is a pragmatic default for mid-market, 100-1, 000 if you can). Include language, scan quality, tables and the main templates you see in production.
  2. Pick two models per category (one schema extractor, one parser) and run end-to-end tests. Measure per-field precision/recall, full-document accuracy, table fidelity, confidence scores, latency percentiles and cost per page on your hardware.
  3. Log common failure modes, such as mis-located fields, split tables and mis-ordered reading flow. Use those to design post-processing rules and human-review thresholds.
  4. Define an automation gate: set a confidence threshold that determines which docs can be auto-committed and which go to human review. Track these over time and shrink the human queue with targeted rules or fine-tuning.
  5. Legal checkpoint: confirm license terms for weights and model cards and get legal sign-off if thresholds or “no-competitive-hosting” clauses exist.

Decision checklist for leaders (3 quick bullets)

  • Identify the task first: fixed-field extraction for invoices and forms → schema extractor. Faithful layout, tables and RAG → parser.
  • Run a 200-document A/B pilot: one schema extractor + one parser, measure field/full-document accuracy and per-page cost on your hardware.
  • Define human-review SLAs and a roadmap for reducing human effort (rules, fine-tuning, template detection).

Key takeaways, questions you should ask (and short, honest answers)

  • Which task do I actually need: schema extraction or document parsing?

    Schema extraction when you need fixed fields (invoices, receipts, contract clauses). Document parsing when you need faithful structure for indexing, tables or formulas for downstream RAG/analytics.

  • Are open-source models production-ready?

    Yes for many use cases when combined with validation, human-in-the-loop controls and license review. Success depends on picking the right model class and testing on your documents.

  • Do benchmark scores give me the final answer?

    No. Benchmarks are heterogeneous. Run a short, representative evaluation on your own documents and report per-field precision/recall, full-document accuracy and latency percentiles.

  • What license red flags should I watch for?

    Custom licenses and modified OpenRAIL-M variants can include startup revenue thresholds or “no-competitive-hosting” clauses. Capture LICENSE and model-card files and escalate to legal before commercial deployment.

  • How should I balance cost and privacy?

    Host locally or in private clouds if compliance or sustained volume makes hosted APIs costly. Use published hosting estimates (e.g., Ai2’s rough $178 per million pages) as planning baselines, and include engineering and human-review costs in TCO.

Next practical step

Sample 200 of your real documents, run two schema extractors and two parsers locally, measure field and full-document accuracy, and choose the combination that minimizes human review while meeting compliance and license constraints. The numbers will show where to invest engineering time, template rules, fine-tuning or an incremental human-in-loop program, and get you from prototype to repeatable automation.