HTML extractor choice rewires LLM training—use multi-extractors to preserve tables and code

How a tiny HTML preprocessing choice rewires what LLMs learn

TL;DR: A new study from researchers at Apple, Stanford, and the University of Washington shows that the choice of HTML extractor—the tool that turns Common Crawl pages into plain text—dramatically changes which web pages and structured artifacts end up in LLM training data. Using the union of three popular extractors can boost token yield by roughly 70% (≈58% after deduplication) and preserves tables and code that single-extractor pipelines often lose. For product teams and executives, that means small preprocessing choices can create large capability and risk differences in your models.

Why HTML extraction matters (think of it as a lens)

HTML extraction is the step that converts raw web pages into plain text your models can read. It sounds like plumbing, but it’s actually the lens you use to look at the web. Change the lens and you change what the model sees—sometimes drastically.

“A seemingly trivial preprocessing step — choosing how to pull text from HTML — can substantially change which parts of the web a model actually learns from.” — (Apple / Stanford / UW, arXiv)

Most large datasets start from Common Crawl, then run an extractor to remove navigation, ads, or boilerplate. That clean text is filtered, deduplicated, and fed into training. The new study shows that which extractor you choose—not just what filters you apply afterward—determines large swathes of the dataset.

What the researchers measured

They compared three widely used HTML extractors:

  • Resiliparse — speed-optimized; preserved tables and code best.
  • Trafilatura — balanced; attempts Markdown-like conversions but sometimes loses important whitespace or cell content.
  • JusText — stopword-heuristic approach that often strips boilerplate but frequently removes tables and code blocks.

Using Common Crawl as the input corpus, the team measured page overlap, token yield before and after deduplication, and downstream benchmark performance (including WikiTableQuestions for table tasks and HumanEval for code generation). They also trained/ran experiments at the 7B scale to illustrate how extractor choice changes real model behavior. Where the paper didn’t specify a particular pipeline detail, the authors note teams should report extractor versions and config for reproducibility.

Headline findings

  • Only 39% of pages were captured by more than one extractor; 61% of pages appeared in only a single extractor’s output.
  • Combining all three extractors increased raw token yield by about +71%. After deduplication the union still contained roughly +58% more tokens.
  • Concrete example: a 7B-model dataset grew from 193 billion tokens (Resiliparse-only) to 283 billion tokens when all three were combined.
  • Structured-content performance varied dramatically: on WikiTableQuestions, a 7B model trained on Resiliparse data scored 11.9, Trafilatura 3.7, and JusText 1.6.
  • Code-generation suffered when code blocks were removed: JusText caused up to a 3.6 percentage-point drop on HumanEval versus Resiliparse.

“Using the union of multiple extractors can boost available tokens by roughly 70% while keeping benchmark performance intact.” — (Apple / Stanford / UW, arXiv)

Why structured content breaks—and why it matters

Tables, code blocks, and other formatted snippets need precise spacing, delimiters, and cell boundaries to remain useful. Extractors use very different heuristics:

  • Resiliparse prioritizes speed and rule-based retention of table and code markers, so it preserves structure better.
  • Trafilatura tries to convert layouts into readable Markdown-like text, which can be useful but sometimes scrubs or collapses cells and whitespace.
  • JusText uses stopword-density heuristics to strip boilerplate; that often removes the markers that indicate a table or code block altogether.

When your training data loses structure, models lose practice on parsing tables and code. That gap shows up in product features: data extraction agents miss columns, code assistants fail on examples that require exact indentation, and analytics models produce noisier answers when source tables are mangled.

“Structured artifacts like tables and code may be lost or degraded depending on your extractor; that materially affects downstream capabilities in table and code tasks.” — (Apple / Stanford / UW, arXiv)

Business trade-offs and risks

Choosing a multi-extractor pipeline increases coverage and preserves structure, but it comes with costs:

  • Compute & storage: More raw tokens and more intermediate outputs mean higher storage bills and longer preprocessing runs.
  • Complexity: More moving parts in the pipeline increases engineering maintenance and the risk of silently divergent behavior across versions.
  • Legal & safety exposure: Broader extraction can pull in more copyrighted or harmful content, increasing moderation and compliance costs.
  • Reproducibility: Two teams both claiming “Common Crawl” can end up training on materially different corpora if they use different extractors or configs.

Practical checklist: what to ask your vendors and teams

  • Which extractor(s), versions, and configs were used?
    Ask for explicit extractor names, version numbers, and configuration thresholds (e.g., minimum content length, language filters).
  • Token counts before and after deduplication?
    Request raw token yield, token yield after dedupe, and the deduplication method (hashing, embedding similarity, thresholds).
  • Structured-content proportion and samples?
    Ask what fraction of the corpus contains tables, code blocks, or similar artifacts and request representative samples.
  • Benchmarks run and targeted metrics?
    Verify if they ran WikiTableQuestions, HumanEval, or product-specific tests. Ask for side-by-side results with different extraction pipelines.
  • Safety and license filtering?
    Which takedown lists, license filters, or human-in-the-loop reviews are applied? How are copyright and PII handled?
  • Pipeline manifest and reproducibility log?
    Insist on a manifest containing extractor names/versions, dedupe hashes, sampling seeds, and timestamps for auditability.

Concrete mitigation and architecture patterns

Based on the findings, the recommended pattern is clear: aggregate multiple complementary extractors, apply strict per-page filters, deduplicate aggressively, and then run targeted audits. The paper found that combining strict filters across multiple extractors outperformed simply loosening thresholds on a single extractor—especially when data is scarce.

Operationally:

  • Run two-to-three extractors in parallel and take the union before dedupe.
  • Deduplicate using content hashes plus embedding similarity to catch paraphrase-level duplicates.
  • Label and sample pages with structured content for human review; build classifiers to detect tables and code automatically.
  • Maintain a dataset manifest so your training data is auditable and reproducible.

A quick vignette: the sales AI that misread invoices

A mid-market CRM vendor deployed an automated assistant to read uploaded invoices and extract line items for accounting. The model had been trained on a Common Crawl–derived corpus where the extractor had stripped table cells and collapsed spacing. In production the assistant consistently merged columns and misaligned prices with item names, producing reconciliation errors and missed invoices. A dataset audit revealed the extractor choice had removed the very structure the product relied on. Re-training with a multi-extractor pipeline that preserved tables fixed the issue—at the cost of extra preprocessing and legal review for newly included pages.

Data governance and final recommendations

Dataset curation is now a strategic lever for product differentiation. For leaders building or buying models—especially for code assistants, analytics, or any feature that depends on tables or code—don’t treat extraction as plumbing. Treat it as signal selection.

  • Audit training pipelines and ask for extractor manifests.
  • Run targeted benchmarks that reflect your product (e.g., WikiTableQuestions for table tasks, HumanEval for code assistants).
  • Adopt a multi-extractor + strict-filtering + deduplication pattern if you need broader, higher-quality coverage of structured content.
  • Budget for the added compute, storage, and legal/audit work—extra tokens aren’t free.

If you want help turning this into a one-page vendor questionnaire or a short benchmarking plan tailored to a product (sales AI, code assistant, analytics), I can draft that for your team.