BigSet: Turn Plain-English Requests into Live, Structured Datasets
TL;DR: BigSet uses LLMs and constrained AI agents to infer a table schema from plain-English instructions, discover matching web targets, and run parallel, budgeted sub-agents to extract, dedupe, and export live datasets—no hand-coded scrapers required. It’s open-source (AGPL-3.0), self-hosted via Docker, and best suited for rapid prototyping of prospecting, monitoring, and competitive-intel workflows.
Why this matters for business leaders
Sales teams, product managers, and competitive-intel squads spend hours stitching lists together: startup directories, hiring signals, pricing pages, supplier inventories. Building reliable scrapers is slow and brittle. BigSet reframes that work by letting you describe the table you want in plain English and returning a downloadable dataset populated from live web sources. The upside is speed and repeatability; the catch is operationalizing model and scraping costs, plus governance for legal and privacy risks.
Scenario: your SDR team needs an up-to-date list of 1,000 AI startups hiring in the U.S., with careers pages and LinkedIn headcounts. Instead of shipping a feature request to engineering, you type a sentence, validate a proposed schema, and let BigSet fetch and refresh that table on a cadence. That’s the user story BigSet targets.
What BigSet actually does
- Natural-language → schema: You describe the table in plain English and BigSet proposes a schema (column names and types).
- Discovery: An orchestrator finds candidate entities on the web (company pages, GitHub, Crunchbase, news articles).
- Parallel extraction: The system fans out sub-agents (one per row) to fetch pages, extract fields, and return results.
- Deduplication & provenance: Rows are deduped by primary key and each row includes a source URL for auditing.
- Scheduled refreshes & exports: Cadences from 30 minutes to weekly, and results exportable as CSV/XLSX (SQL and agent-native APIs planned).
Describe what you want in plain English, and BigSet will infer the table layout, discover matching entities, and populate rows with sourced data.
How it works — a simple metaphor and flow
Think of BigSet as a small field operation:
- The planner (schema-inference agent) draws the blueprint for the table.
- The scout (orchestrator) finds promising targets on the web.
- Field-workers (parallel sub-agents) each visit one target, gather the fields, and return a row.
- A supervisor (the infrastructure layer) consolidates rows, dedupes, attaches source URLs, and writes only via a safe API.
Architecturally, BigSet uses a two-tier multi-agent pattern: first a schema-inference agent decides the columns, then an orchestrator discovers candidates and fans out many budget-limited sub-agents to extract values. Separating planning from execution reduces unnecessary crawling and focuses the work on the fields the schema needs.
Key technical facts
- License: AGPL-3.0 (repo: tinyfish-io/bigset).
- Defaults: schema inference uses anthropic/claude-sonnet-4.6 (via OpenRouter); orchestration defaults to qwen/qwen3.7-max (via OpenRouter). Models and routing are configurable via environment variables.
- Discovery and page fetch powered by TinyFish Search & Fetch; backend uses Postgres and Convex; Clerk handles authentication.
- Sub-agent constraints: hard cap of 6 tool calls per sub-agent to limit runaway browsing and costs.
- Every row includes a source URL; primary-key deduplication collapses duplicate entities.
- Typical initial dataset generation: ~2–5 minutes for moderate-sized tables (time scales with rows and rate limits).
- Refresh cadences supported: 30 minutes, 6 hours, 12 hours, daily, weekly.
Example prompt and an inferred schema
Example plain-English request:
“Create a table of US-based AI startups founded after 2018 with columns: company name, HQ city, founding year, latest funding round, estimated headcount, careers page URL.”
BigSet might infer this schema:
- company_name (string)
- hq_city (string)
- founding_year (integer)
- latest_funding_round (string)
- estimated_headcount (string)
- careers_page_url (url)
- source_url (url)
Seeing the proposed schema is a useful approval step: it prevents wasted crawling and surfaces ambiguous fields (e.g., “estimated headcount” — scraped from LinkedIn vs. company pages?).
Running BigSet: requirements, costs, and latencies
Minimum setup: Docker, Make, and three API keys (TinyFish, OpenRouter, Clerk). The repo includes a Makefile to boot a dev stack (Postgres, Convex, frontend/backend, Mastra).
OpenRouter is pay-as-you-go; the README recommends provisioning $5–10 in credits to get started. Row operations and model calls consume credits and may be subject to quotas (an example quota noted was 2,500 rows/month).
Example cost pro-forma (illustrative)
Cost = orchestration costs + per-row extraction costs + TinyFish fetch costs + OpenRouter routing overhead.
Plug-in assumptions you can adjust:
- Assume average LLM calls per row: 3 (one extract + two verification/cleanup calls)
- Average tokens per call: 1,000
- Model price: $0.03 per 1k tokens (example; actual rates vary by model)
Then approximate per-row LLM cost = 3 calls × 1k tokens × $0.03 = $0.09 per row.
So, 1,000 rows ≈ $90 in LLM compute (plus TinyFish fetch costs and OpenRouter fees). If you reduce tokens-per-call or calls-per-row (via tighter schemas or lighter parsing), costs drop proportionally. These numbers are illustrative—run a small pilot and measure token usage per row to refine your estimate.
Latency guidance: simple 50–100 row jobs commonly finish in a few minutes; hundreds-to-thousands of rows scale linearly with parallelism and external rate limits. Expect longer runtimes if pages require complex navigation, paywalls, or captchas.
Security, safety, and compliance
BigSet builds practical defenses into its design:
- Tool-call caps: Sub-agents are limited to six tool calls to reduce runaway browsing and limit attack surface.
- Infrastructure-enforced writes: Dataset write permissions are held outside the model process (closed-over dataset identifiers), so an agent cannot directly mutate arbitrary datasets via prompt-injection.
- Per-row provenance: Each row stores a source URL, making it easier to audit, debug, and verify extraction results.
The system separates schema planning from web access and keeps dataset write rights in infrastructure, preventing models from arbitrarily modifying tables.
Legal and compliance checklist (practical items to evaluate before production use):
- Confirm target sites’ terms of service and robots.txt policies; get legal sign-off for high-risk sources.
- Watch for PII—mask or exclude personally identifiable information unless you have a lawful basis and controls.
- Consider rate-limiting and polite crawling policies to avoid IP bans; use IP rotation and throttles if necessary.
- Plan for monitoring and alerts on extraction failures, spikes in token usage, and suspicious agent behavior.
Limitations and open questions
BigSet is powerful for prototyping and many production use-cases, but some gaps remain for strictly regulated environments:
- Quality: extracted values depend on page structure and signal quality—discrepancies across sources require reconciliation logic.
- Scale: costs and rate limits matter at tens or hundreds of thousands of rows—expect to budget accordingly and consider hybrid strategies (managed APIs for high-volume sources).
- Enterprise features: RBAC, fine-grained audit logs, SQL querying, lineage views, and SLAs are not yet first-class—teams will likely build these on top.
- Paywalled or authenticated content: additional tooling is required to access authenticated sources securely and ethically.
Where BigSet delivers the most ROI (use cases)
- AI for sales / prospecting: Build and refresh target lists (companies hiring for “research engineer” roles, firms raising seed rounds) and push exports into CRMs or outreach tools.
- Competitive intelligence: Track product pricing, feature announcements, or job postings across competitors.
- Retail & inventory monitoring: Monitor GPU prices, stock levels, or reseller listings for price arbitrage and procurement signals.
- Supplier and vendor discovery: Aggregate supplier directories and validate contact info for Sourcing teams.
Business impact is mostly time-savings and reduced engineering overhead. Instead of three sprints to build and maintain scrapers, a small team can prototype a dataset in hours and iterate based on source quality and schema adjustments.
Integrations and practical next steps
- Exports: CSV/XLSX for direct ingestion into CRMs (Salesforce/HubSpot), BI tools (Looker/Tableau), or data lakes (S3).
- Automation: schedule refreshes and wire the export into an ETL job or a webhook to trigger downstream workflows.
- Monitoring & ops: log token usage, failed row counts, and source-level error rates; export provenance to simplify triage.
Quick start checklist
- Clone the repo: tinyfish-io/bigset.
- Provision API keys: TinyFish, OpenRouter, and Clerk; add initial OpenRouter credits ($5–10 recommended for a pilot).
- Run the dev stack via Make and Docker; try one of the nine sample datasets to see end-to-end behavior.
- Measure token usage per row on a 50–100 row run to produce an accurate cost estimate.
- Validate legal/compliance checklist for your target sources before scaling.
Final takeaways
BigSet demonstrates a useful pattern for AI automation: use LLMs as planners to define structure, then execute extraction with constrained, observable agents and enforce writes at the infrastructure layer. For teams evaluating AI for business or AI for sales applications, BigSet is a strong prototype platform: it reduces bespoke scraping work, surfaces provenance by design, and gives teams a quick path to live datasets. Expect to pair it with governance, monitoring, and cost controls before you push it into regulated production.
Suggested meta-description for sharing: “BigSet uses LLMs and constrained AI agents to convert plain-English requests into live, exportable web datasets—ideal for prospecting, monitoring, and competitive intelligence.”