MACROS: Generative AI on Amazon Bedrock for Scalable, Auditable Medical Content Review

MACROS: Generative AI for Scalable Medical Content Review

Executive summary

Problem: Thousands of medical articles need continual verification; manual review is slow, inconsistent, and risky for a consumer health product.
Solution: MACROS (Medical Automated Content Review and Revision Optimization Solution) uses Amazon Bedrock and AWS services to extract rules from guidelines, scan content, flag non‑adherent text, and propose edits in Flo’s voice—while keeping clinicians in control.
PoC results: processing speed exceeded the 10x target, recall for identifying content needing updates topped 90% (recall = proportion of true issues the system finds), and overall accuracy on PoC test sets was ~80% (accuracy = correct decisions ÷ total decisions).
Reality check: AI reduced workload and improved consistency, but human validation, provenance, and regulatory safeguards remain essential before production rollout.

The problem, plain and simple

When a health content library spans thousands of articles and guidelines, keeping every sentence up to date is a full-time job. Manual review is expensive, slow, and inconsistent: different reviewers interpret guideline nuance differently, and updates can lag behind the latest evidence or regulatory guidance. That delay creates risk for users and liability for product teams.

How it works (in plain English)

Read the source materials: MACROS pulls articles and the latest clinical guidelines into a central store.
Turn rules into machine language: a Rule Optimizer extracts actionable rules (for medical correctness and editorial style) and ranks them by priority.
Scan and split: content is filtered and chunked into manageable pieces so models can reason reliably.
Review and propose fixes: foundation models check each chunk against the rules, flag issues, and suggest phrasing aligned with the brand voice.
Human verification: editors and clinicians review flagged items and approve, edit, or reject suggested changes before they go live.

The MACROS approach

Flo Health and the AWS Generative AI Innovation Center built MACROS to automate the heavy lifting while preserving clinical oversight. At its core is the Rule Optimizer: it reads unstructured guidelines and converts them into prioritized, machine-actionable rules—tagged as medical or style-related, and given a high/medium/low priority. These rules guide the content review engine and feed into a structured output that downstream systems can consume.

“Flo Health used generative AI to make verification and maintenance of its extensive health library feasible at scale, because manual review is too slow and error-prone.” — Flo Health / AWS team (paraphrased)

Output is designed to be machine-readable and auditable: each flagged section returns a structured record (for example, section text, adherence flag, rule name, and reason). That makes it trivial to integrate with a CMS API, show suggested edits in an editor UI, or maintain audit logs for compliance.

Architecture and tech stack

The system stitches foundation models into a robust AWS stack that engineers can follow and extend. Key components include:

Amazon S3 — canonical storage for articles, guideline PDFs, and results.
Amazon Textract — extract text from scanned PDFs during ingestion.
AWS Lambda — pre-processing, Rule Optimizer, and modular review/revision steps.
AWS Step Functions — orchestrates the multi-step pipeline (current PoC); Bedrock Flows is being evaluated as a simpler orchestration path.
Amazon Bedrock — hosts foundation models used for rule extraction, review, and revision.
Amazon ECS (Streamlit UI) — editor-facing interface where medical experts validate suggestions.
API Gateway and IAM — secure programmatic access and authentication.
CloudWatch — observability, logging and metrics.

Model tiering and cost control

One pragmatic lever in MACROS is model tiering: lightweight models handle parsing and chunking while larger, higher-capability models perform reasoning and rewrite tasks. This mix reduces cost without sacrificing the quality of the most important decisions.

Concrete approach used in the PoC:

Small models (for example, Haiku-family models on Bedrock) for token-efficient tasks like chunking and simple rule matching.
Larger models (Sonnet/Opus families or equivalent) for complex clinical reasoning and style-preserving rewrites.
Token and prompt optimization, caching, and routing to reduce unnecessary inference calls.

Proof-of-concept results — what the team measured

Speed: Processing time per guideline dropped from hours to minutes in many cases, exceeding the 10x throughput target.
Recall: Over 90% on PoC test sets (recall = proportion of true issues the system finds).
Overall accuracy: ~80% on validation sets (accuracy = correct decisions divided by total decisions).
Consistency: AI applied guidelines more consistently than many human reviewers, which reduces variance across content updates.

Those numbers signal strong promise, but they also expose the gap between a PoC and production-ready tooling: the dataset sizes, edge cases, and real-world distribution of content will determine final performance. Expect to iterate on test coverage, prompt engineering, and manual adjudication workflows to push accuracy higher.

Example: before and after (anonymized)

Original: “Some studies suggest Supplement A helps with condition X, so users should consider it.”

Flagged reason: Guideline requires citation and cautions for safety; language is directive without evidence.

Suggested revision: “Limited studies report benefits of Supplement A for condition X; discuss potential risks with a clinician and consult current guidelines [link to source].”

Lessons learned and best practices

Standardize inputs early — canonicalize formats, normalize headings, and centralize guideline sources to reduce ingestion noise.
Build diverse test sets — include long-form articles, short tips, and edge-case language to surface failure modes.
Instrument everything — log rule IDs, model prompts, confidence scores, and reviewer decisions to create an audit trail.
Prioritize human-in-the-loop — use automatic edits only for low-risk style fixes; route high-priority medical changes to clinicians.
Version control your rules and guidelines — track changes to the rule base and to the foundation models used for evaluation.
Optimize for cost from day one — model tiering, prompt length control, and batching reduce token spend dramatically.

Risks, governance and mitigation

Generative AI introduces specific risks in healthcare contexts: hallucinations (confident but incorrect statements), false positives that waste reviewer time, and false negatives that miss clinically important errors. Regulatory and compliance concerns — HIPAA, consumer health claims, and auditability — require explicit controls.

Provenance: capture the original source and rule ID for every suggested change so editors and auditors can trace decisions.
Explainability: surface the reasoning behind a flag (rule matched, citation mismatch, outdated statistic) rather than opaque model output.
Human sign-off: require clinician validation for any high-priority medical edits; use staged rollouts and A/B validation for lower-risk changes.
Monitoring: track drift by checking model outputs over time against new labels and re-run validations after model updates.

Checklist for teams starting a similar PoC

Identify canonical guideline sources and obtain machine-readable copies.
Assemble a labeled validation set (sample of articles + clinician annotations).
Define acceptance metrics up front (recall target, acceptable precision, throughput goals).
Decide which edits can be auto-applied vs. which require human review.
Plan for logging, versioning, and audit trails from day one.
Allocate multidisciplinary owners: content editors, ML engineer, compliance/legal, and product manager.

Key questions and short answers

Can generative AI verify and update medical content at scale?
Yes—PoC results show dramatically faster processing and high recall for issues that need updates, while human review remains essential for final validation.

What architecture supports such a system?
A modular AWS stack—S3, Textract, Lambda, Step Functions (or Bedrock Flows), Bedrock models, ECS UI, API Gateway, IAM and CloudWatch—lets teams combine generative AI with existing pipelines and observability.

How are models chosen and tasks decomposed?
Tasks follow filter → chunk → review → revise → post-process. Use smaller models for parsing and larger models for deep reasoning and rewrites to balance cost and performance.

What’s next (and what to expect in Part 2)

Production scaling: deployment topology, autoscaling, and cost modeling for sustained throughput.
Operational controls: drift detection, model update policies, and rollback strategies.
Auditability and regulatory alignment: building robust provenance, explainability artifacts, and compliance workflows.
Orchestration choices: evaluating Bedrock Flows vs. Step Functions for simpler AI-native workflows.

MACROS demonstrates that generative AI can move the needle on medical content maintenance: faster updates, more consistent application of guidelines, and reduced expert workload. The payoff isn’t automatic—success hinges on careful rules engineering, model selection, auditability, and a culture that keeps clinicians central to final decisions. If your organization publishes health content at scale, start by defining your acceptance metrics and rule taxonomy; the rest is engineering and governance.

Want Part 2? Follow for the next installment covering production rollout, cost modeling, monitoring strategy, and real-world lessons from scaling MACROS.