Amazon Self-Learning Generative AI for Product Catalogs: Disagreement-as-Signal at Scale

How Amazon Built a Self‑Learning Generative AI System for Product Catalogs

When a system receives millions of product listings every day, static AI models quickly become brittle. Amazon’s Catalog team flipped the script by treating model disagreements not as failures but as the richest learning signal—creating a self‑learning generative AI pipeline that improves accuracy while cutting operational cost. Elevator summary: many small models handle routine cases; disagreements trigger a smarter supervisor agent that harvests context and writes reusable rules into a compact knowledge base—those rules are injected back into prompts so the system improves without retraining large models.

The pattern at a glance

  • Multi-model architecture: many lightweight worker models (generators + evaluators) run in parallel to reach cheap consensus.
  • Disagreement-as-signal: worker-to-worker disagreements flag cases likely to need deeper reasoning or human review.
  • Supervisor agent: a higher-capability AI (or hybrid human+AI) resolves disputes, pulls broader context, and writes generalized learnings.
  • Hierarchical knowledge base: learnings are stored with provenance and metadata, then injected into worker prompts via prompt injection for future inferences.
  • Continuous improvement loop: declining disagreement rate becomes the north-star metric showing learning effectiveness.

Think of it like a factory: small machines (worker models) assemble most parts quickly; when a part doesn’t fit, a skilled inspector (supervisor) examines the whole context, writes a fix, and adds that fix to the instruction manual so the assembly line handles it correctly next time.

Step-by-step: from disagreement to learning

  1. Workers generate and evaluate: multiple small models produce candidate attributes (e.g., title, color, size) and paired evaluators score each candidate.
  2. Consensus accepted: when workers agree, the output is accepted cheaply and at scale.
  3. Disagreement triggers escalation: when outputs diverge beyond a threshold, the case is routed to a supervisor agent.
  4. Supervisor gathers richer context: the supervisor fetches seller inputs, reviews, return history, images, and appeals to determine the correct result.
  5. Supervisor synthesizes a learning: the supervisor converts its judgment into a short, general rule or prompt refinement, with provenance and confidence score.
  6. Knowledge base stores the learning: hierarchical storage keeps the learning, metadata (who approved it, when, and on what evidence), and expiry/retirement rules.
  7. Prompt injection for reuse: relevant learnings are injected into worker prompts or retrieval cues so small models behave as if specialized—without retraining.

Worked example: a color conflict that became a rule

Scenario: A seller lists a shirt as “navy” while the title says “blue” and the product images appear dark. Two worker models disagree—one maps to “blue”, another to “navy”. Disagreement flags escalation.

The supervisor inspects the listing, images, and prior returns for color mismatch complaints. It decides the correct attribute is “navy blue” for search relevance. The supervisor writes a compact learning such as:

If seller-supplied color is “navy” but title contains “blue” and image histogram indicates dark blue, normalize to “navy blue”. Trigger only for men’s shirts category and require confidence >= 0.8.

The learning is stored with provenance (listing id, date, evidence), routed to the knowledge base, and injected into worker prompts for future men’s shirt listings. Over time that rule prevents the same disagreement from recurring.

Why disagreement is the high-value signal

Disagreements concentrate edge cases and evolving language. Rather than waste cycles labeling every possible edge case, the system focuses human or higher-capability compute only where worker models disagree. That makes learning targeted, auditable, and cost‑efficient.

“Rather than treating model disagreements as failures, we treat them as the highest-value learning signals.”

Measuring success: the metrics that matter

Primary health metric:

  • Disagreement rate: fraction of items that require supervisor escalation. A declining trend indicates successful knowledge capture and prompt injection.

Additional operational metrics:

  • Supervisor invocation rate (per million items)
  • Manual review volume and average review time
  • Downstream quality: returns attributable to misattributes, conversion lift from improved titles, and search discovery metrics
  • Cost per inference and cost per corrected error

Representative production example (anonymized): disagreement rate fell from ~4% to ~1% over several months after learnings were injected; supervisor invocations dropped by roughly 70%, concentrating human reviews on the rarest, highest-value cases. Those numbers will vary by catalog complexity and initial model performance, but the pattern—fewer escalations as learnings accumulate—holds.

When to use this pattern (and when not to)

  • Ideal: high-volume, rapidly evolving domains where language and seller behavior drift (product catalogs, marketplaces, user-generated content moderation).
  • Less ideal: low-volume tasks, systems governed exclusively by static rules, or high-stakes domains (medical or legal) that require rigorous certification—unless additional governance layers are added.

Implementation options: vendor-neutral blueprint

Core components and roles:

  • Worker models: many small generator models plus lightweight evaluators. Keep these cheap and fast for scale.
  • Supervisor agent: a larger model or hybrid human+AI workflow that resolves disagreements and authoritatively creates learnings.
  • Knowledge base: hierarchical, versioned store with provenance, confidence, and TTL for each learning.
  • Retrieval & prompt injection: mechanism to surface relevant learnings at inference and merge them safely into prompts.
  • Human-in-the-loop: review queues for high-risk learnings or a staged rollout process.
  • Observability: telemetry for disagreement rates, supervisor load, downstream KPIs and alerts for unusual drift.

Operational playbook highlights:

  • Start with conservative disagreement thresholds to avoid over-escalation.
  • Require supervisor confidence and at least one human audit before wide injection for high-risk categories.
  • Track provenance to allow rapid rollback of any problematic learning.
  • Automate retirement of stale rules and periodic revalidation against sampled live traffic.

AWS-specific stack (how Amazon’s Catalog team implemented it)

  • Amazon Bedrock: multi-model access to run both efficient and larger foundation models.
  • Bedrock AgentCore: agent runtime, memory management, and observability for supervisor logic and injected memory.
  • EC2 (GPU, CPU instances): hosting for smaller open-source worker models and throughput scaling.
  • DynamoDB: hierarchical knowledge store to persist learnings, provenance, and routing metadata.
  • SQS: human-review queues and workflow integration.
  • CloudWatch: observability, dashboards, and alarms for disagreement and supervisor metrics.

These tools make it practical to run a continuous improvement loop at web scale while keeping learnings auditable and traceable.

Governance, adversarial risk, and failure modes

Turning production signals into rules introduces risks: noisy or biased feedback, adversarial manipulation by bad actors, and supervisor hallucinations. Mitigations include:

  • Provenance and confidence: store evidence links, timestamps, and supervisor confidence for every learning.
  • Guardrails for ingestion: require multiple independent confirmations or human sign-off for learnings that affect pricing, safety, or regulated categories.
  • Rate limits and provenance-based throttles: prevent a single seller or cohort from poisoning the knowledge base.
  • Staged rollouts: test new rules in a canary subset, measure downstream impact, and delay broad injection until validated.
  • Automated drift detection: alerts for sudden spikes in disagreement rates or changes in rule invocation patterns.

Periodic audits should be scheduled (weekly for high-risk categories, monthly for general catalog rules) and every learning should carry a human-readable rationale so audits are efficient.

Practical one-page implementation checklist

  • Roles: ML engineers, MLOps, applied scientists, product owners, governance reviewers, and a human review team.
  • Data sources: listings, images, reviews, returns, appeals, historical manual labels.
  • Initial thresholds: disagreement threshold (start conservative), supervisor confidence cutoff (e.g., 0.8), and rule expiry policy (e.g., 180 days unless revalidated).
  • Knowledge base schema: rule text, trigger conditions, evidence links, provenance (who/what produced it), confidence, category tags, creation date, expiry date.
  • Audit workflow: staging environment, human sample audits, approval gates for high-risk categories, rollback process.
  • Observability: dashboards for disagreement rate, supervisor load, rule invocation counts, conversion/return deltas.
  • Security: rate limits on rule creation, provenance checks, and access controls for editing knowledge base entries.

Common questions

How can generative AI improve over time without full model retraining?

By extracting generalized learnings from supervised escalations and storing them with provenance in a hierarchical knowledge base that’s injected into worker prompts—enabling continuous improvement without retraining heavyweight models.

How do you balance cost and capability at scale?

Run many small, efficient worker models for routine consensus and invoke a more expensive supervisor agent selectively when disagreements indicate complex cases—saving compute while capturing high-value signals.

What operational metric signals the system is learning?

The declining worker-to-worker disagreement rate is the primary health metric—it shows fewer escalations and that learnings are taking hold.

How do you keep the knowledge base trustworthy?

Introduce governance: human audits, staged “learn-then-deploy” flows for high-risk categories, provenance tracking, confidence thresholds, and safeguards against noisy or adversarial inputs.

Final note and next step

This pattern turns production disagreements into a profitable learning engine: a scalable way to specialize general-purpose models for messy, high-volume domains without constant retraining. For teams building AI automation and AI agents, it’s a pragmatic architecture that converts operational friction into institutional knowledge.

If a one-page implementation checklist or a tailored rollout playbook would help your team start, a compact, actionable version can be provided on request to map roles, thresholds, telemetry, and governance controls to your catalog and risk profile.