Pushpay & AWS Built Production-Ready Agentic AI on Amazon Bedrock — 95% Accuracy, Sub-4s

How Pushpay and AWS Built Production-Ready Agentic AI with Amazon Bedrock

TL;DR
Pushpay transformed a 60–70% accuracy prompt prototype into a production AI search that delivers sub-4s time-to-insight (≈15× faster) and ~95% accuracy for high-impact domains by combining a golden dataset, domain-level observability, and Bedrock features like prompt caching.
Core ingredients: a natural-language UI, an AI search agent (system prompt + dynamic prompt constructor using semantic search), Claude Sonnet 4.5 on Amazon Bedrock, an LLM-as-a-judge closed loop, and dashboards with domain metrics and latency percentiles (p50–p90).
Practical lesson: production-ready AI agents are a product-engineering problem—measurement, controlled rollouts, and governance matter more than endless prompt tinkering.

The problem: prototype plateau, not a prompt puzzle

Pushpay—serving churches and faith-based organizations—built a natural-language query layer to surface donor and engagement insights. Early prototypes relied on prompt engineering and hit a plateau: roughly 60–70% “accuracy” on business queries. Instead of continuing the prompt loop, Pushpay and AWS reframed the effort as product engineering: add measurement, observability, and a controlled rollout strategy.

“A production-ready agent demands more than clever prompts — it needs a scientific, data-driven evaluation and observability foundation.”

What “production-ready agent” meant for Pushpay

Fast, deterministic responses to plain-English queries with structured outputs (JSON) that downstream systems can consume.
High accuracy for customer-facing domains—Pushpay targeted ~95% for high-impact areas—while degrading or suppressing weaker categories during rollout.
Safety and governance: PII and pastoral data are protected and never used to train external models under the AWS Shared Responsibility Model and Pushpay’s internal standards.

Key definitions (short and practical)

Agentic AI: An AI that executes multi-step tasks or queries on behalf of a user, often coordinating search, reasoning, and actions.
Semantic search: Finding relevant context (documents, filters) based on meaning, not keyword matches.
Prompt caching: Reusing large, stable prompt context so the model call sends fewer tokens, lowering latency and cost.
LLM-as-a-judge: Using an LLM to compare model outputs to expected answers and flag mismatches for human review.
Wilson score interval: A statistical method for confidence intervals, useful for accuracy estimates with small samples.
Latency p50–p90: Median (p50) to 90th-percentile (p90) response-time measurements—used to decide UX trade-offs.

Architecture — simple flow

High-level components:

Natural-language UI accepting plain-English queries
Semantic search that selects relevant filters and context from 100+ configurable filters
Dynamic prompt constructor that builds a compact, structured prompt for the LLM
Amazon Bedrock running Claude Sonnet 4.5 to produce structured JSON outputs
Prompt caching layer to reuse stable context and reduce tokens
Closed-loop evaluator: LLM-as-a-judge + human validation feeding back into the golden dataset
Domain-level dashboards showing accuracy with Wilson intervals and latency p50–p90

User UI
  ↓
Semantic search → selects filters/context (from 100+)
  ↓
Dynamic prompt constructor (+ cached prompt fragments)
  ↓
Amazon Bedrock (Claude Sonnet 4.5) → Structured JSON
  ↓
LLM-as-a-judge → compare vs golden dataset → human review if flagged
  ↓
Dashboards & rollout controls (suppress weak domains)

A concrete example (query → structured JSON)

Plain-English query:

“Show me members who missed the last two Sunday services but made a donation in the last 90 days.”

Sample structured JSON response (trimmed):

{
  "query_id": "q-20260127-001",
  "filters_applied": {
    "attendance_missing": 2,
    "donation_window_days": 90
  },
  "results": [
    {"member_id":"m-102", "name":"A. Johnson", "last_donation":"2026-01-04", "missed_services":2},
    {"member_id":"m-208", "name":"R. Lee", "last_donation":"2025-12-18", "missed_services":2}
  ],
  "confidence": 0.92,
  "latency_ms": 380
}

How accuracy was judged: the output fields are compared to expected values in Pushpay’s golden dataset; the LLM-as-a-judge and human validators determine whether each returned field matches the ground truth (field-level correctness).

Golden dataset and the closed-loop evaluator

Pushpay started with 300+ curated, representative queries and expected structured outputs. That dataset is continuously expanded with validated, real user queries. Key practices:

Seed the dataset with high-value, domain-representative queries (donor lookups, attendance joins, engagement cohorts).
Use an LLM-as-a-judge to run fast automated comparisons; flag mismatches for human review rather than auto-accepting the judge’s call.
Track field-level correctness rather than only exact-match strings so partial correctness is visible and actionable.

Observability: domain-level metrics and rollout controls

Aggregate accuracy hides failure modes. Pushpay instrumented domain-level dashboards that display:

Per-domain accuracy with Wilson score intervals (95% confidence)
Sample size and recent trend
Latency distribution (p50–p90)
Examples of flagged failures for quick triage

“Domain-level visibility uncovered weaknesses that aggregate accuracy scores had hidden, enabling targeted fixes and safer rollouts.”

When a domain’s Wilson interval indicates low confidence (small sample + low accuracy), the team can suppress that category from customer-facing flows until fixes are implemented—reducing user exposure while the team iterates.

Production levers that moved metrics

Prompt caching: reduced tokens and latency for large, stable prompt fragments—important when prompts include multi-field schemas and context.
Semantic search: selected a compact, relevant set of filters for each query so the prompt stayed small and precise.
Domain suppression: hide underperforming categories from users during incremental rollout instead of launching an all-or-nothing feature.
Human-in-the-loop validation: used for edge cases, ambiguous outputs, and to expand the golden dataset.

Measured impact

Time-to-insight improved from ~120 seconds (manual navigation) to under 4 seconds with AI search—about a 15× speedup.
High-impact domains reached ~95% production accuracy by suppressing low-performing categories and prioritizing fixes.
Faster engineering cycles and safer rollouts due to clear, domain-level observability and a closed feedback loop.

One failure vignette: how dashboards prevented a bad rollout

Scenario: A complex “volunteer-impact” domain produced plausible-looking outputs but with incorrect mappings between volunteer IDs and event dates. Aggregate accuracy looked acceptable, but the domain dashboard showed a low Wilson interval and an uptick in flagged mismatches. The team suppressed that domain from the public flow, fixed the data-join logic in the prompt constructor, revalidated against the golden dataset, and re-enabled the feature at higher confidence—avoiding customer confusion and potential privacy exposure.

Governance, security, and operational costs

Pushpay kept PII and pastoral data out of external training signals and applied standard cloud controls (encryption, RBAC, logging and audit trails). Important trade-offs:

Human validation and dataset curation impose operational costs—expect labeling, reviewer time, and periodic revalidation as models change.
Judge-LLMs are fast but noisy; mitigate with rule-based checks and spot human audits.
Upgrading underlying LLMs requires regression testing against the golden dataset and controlled rollouts to detect behavior regressions early.

Limits and counterpoints

Not every application maps neatly to CRM-like, filterable domains. Less-structured domains (free-form reasoning, creative synthesis) require different evaluation strategies and may not benefit as directly from prompt caching and semantic filter selection.
LLM-as-a-judge simplifies automated evaluation but can inherit biases or be overconfident; combine it with deterministic checks and human validation for critical paths.
There’s an ongoing operational tax: maintaining the golden dataset, monitoring dashboards, and running human reviews are continuous investments, not one-time setup costs.

Quick playbook for building production-ready AI agents

Start with a representative golden dataset (seed ~300 queries) and expand with validated user queries.
Measure at the domain level and use Wilson score intervals to avoid overconfidence from small samples.
Use semantic search to pull minimal, relevant context instead of dumping all context into every prompt.
Implement prompt caching for stable prompt fragments to save tokens and reduce latency.
Run an LLM-as-a-judge for fast checks, but require human validation for flagged mismatches and critical domains.
Suppress underperforming categories in customer flows and roll out incrementally with clear rollback criteria.
Plan for regression tests and revalidation whenever you upgrade the underlying model.

Practical FAQ

How did Pushpay raise accuracy from ~60–70% to ~95% in production domains?

They combined a curated golden dataset, domain-level evaluation dashboards, and a controlled rollout strategy that suppressed low-performing categories while engineers prioritized fixes. Accuracy here is measured at the field level—structured outputs compared to expected values via an LLM judge plus human validation.

What role did Amazon Bedrock and Claude Sonnet 4.5 play?

Bedrock hosted Claude Sonnet 4.5 to generate structured JSON outputs and provided production features like prompt caching that cut tokens and lowered latency, enabling real-time AI search.

Why focus on domain-level visibility instead of aggregate scores?

Aggregate accuracy can mask critical failure modes. Domain-level metrics with confidence intervals reveal where errors cluster, so teams can prioritize work and control user exposure.

How is sensitive pastoral data protected?

PII and sensitive data are kept secure, excluded from external training, and governed under the AWS Shared Responsibility Model and Pushpay’s internal standards. Practical controls include encryption, RBAC, logging, and redaction of sensitive fields before model calls.

How do you keep improving after launch?

Continuously expand and validate the golden dataset with real queries, use the LLM-as-a-judge loop for automated checks, monitor domain-level metrics, and run controlled rollouts and regression tests when models or prompts change.

Final takeaway for product and engineering leaders

Agentic AI is not just a modeling exercise. The reliable systems ship when product engineering disciplines—measurement, observability, governance, and controlled rollouts—meet modern LLM capabilities. Pushpay’s work with AWS Bedrock shows that combining semantic search, prompt caching, a golden dataset, and domain-level dashboards can turn an unreliable prototype into a production-ready AI agent that’s fast, accurate, and safe for sensitive use cases.

If you want a copy of the sample JSON schema, a short architecture sketch, or a checklist to adapt this pattern to your domain, reach out to your AWS or internal AI engineering team to start a controlled pilot.