From Docs to Answers: How PDI Built an Enterprise RAG on AWS with Serverless and Bedrock

Turning corporate memory into value: How PDI built an enterprise RAG system on AWS

TL;DR

PDI Intelligence Query (PDIQ) is an enterprise Retrieval‑Augmented Generation (RAG) system that turns scattered documents, images and wikis into searchable vectors so LLMs can answer real business questions.
Built with a serverless stack on AWS and Amazon Bedrock models, PDIQ uses targeted models for captioning, summarization, embeddings and generation, plus a 70/10/20 chunking heuristic and cached image captions to boost relevance and cut costs.
Operational wins: faster support resolution, improved customer satisfaction signals, and lower operational overhead via automation. Next steps include agentic flows, table extraction, multilingual indexing and hybrid retrieval.

Why an enterprise RAG system matters for AI for business

When knowledge is distributed across Confluence pages, SharePoint folders, Azure DevOps tickets, PDFs and the heads of senior engineers, teams spend time searching instead of solving. A Retrieval‑Augmented Generation (RAG) system makes that buried knowledge usable by LLMs: it finds relevant evidence, surfaces it with context, and lets a generative model answer questions grounded in company data. For leaders focused on AI automation and measurable business outcomes, a production RAG system is rarely about flashy generation demos; it’s about searchable trust, repeatable workflows and secure scale.

How PDI approached the problem

PDI Technologies—40 years in convenience retail and fuel distribution—built PDI Intelligence Query (PDIQ) to give employees a chat interface that returns LLM‑driven answers supported by internal documentation. The goal was pragmatic: improve time‑to‑resolution for support teams, increase CSAT/NPS trends, and automate repetitive knowledge lookups without exposing sensitive data.

“PDIQ gives employees access to company knowledge through an easy-to-use chat interface, powered by a custom RAG system on AWS.”

PDIQ: high-level architecture (serverless RAG on AWS)

The design favors modular, serverless primitives so each piece can scale independently and be swapped later. Key components:

Ingestion & scheduling: Amazon EventBridge schedules crawls; Amazon ECS runs crawlers (Puppeteer + turndown for web scraping; REST connectors for Confluence, SharePoint, Azure DevOps).
Storage & events: Amazon S3 stores raw and transformed content; SNS/SQS pipelines events and decouples ingestion from processing.
Orchestration: AWS Lambda runs chunking, summarization orchestration and embedding jobs.
Metadata & caching: Amazon DynamoDB stores crawler configurations, image metadata and cached image captions to avoid repeat model calls.
Vector store: Amazon Aurora PostgreSQL‑Compatible Serverless is used as the vector store for similarity search and metadata.
Models: Amazon Bedrock supplies foundation models—Nova Lite for image captions, Nova Micro for summarization, Titan Text Embeddings V2 for embeddings, and Nova Pro for generation.
Security: Zero‑trust with role‑based access (Cognito + enterprise SSO), encrypted crawler credentials via AWS KMS, and application‑layer permission checks.

Key engineering decisions that moved the needle

Three practical choices produced outsized returns on relevance, latency and cost.

1) Make images searchable, and cache the captions

Images are often the missing context in documentation. PDIQ generates captions (Nova Lite), injects them into the corresponding markdown, and stores the captions in DynamoDB. The benefit is twofold: images become first‑class text assets for retrieval, and repeated image requests don’t trigger fresh model calls—saving inference cost and cutting latency.

2) Chunking with summary‑prepend: the 70/10/20 heuristic

Think of chunking like index cards: each card contains local detail, a little overlap with the next card, and a short summary at the top so the reader (the LLM) understands the bigger picture before diving into specifics. PDIQ allocates ~70% of tokens to chunk content, 10% to overlap between chunks, and 20% to a document summary that’s prepended to every chunk. That prepended summary supplies document‑level context while keeping each chunk concise enough for similarity search and generation.

“Chunking uses a 70% content, 10% overlap, 20% summary token split; summaries are prepended to each chunk so the LLM has document-level context.”

3) Task‑specific models for cost and quality

Rather than using one large model for every step, PDIQ matches models to microtasks: a lightweight model for captions, a summarization model tuned for concise context, a dedicated embeddings model, and a generation model for answers. This segmentation keeps expenses predictable and tuning localized to each workspace.

Measured impact and how improvement was validated

During pilot and early rollout, PDIQ tracked both human QA and live user approval signals. On retrieved answers, approval/accuracy rose from roughly 60% to 79% after implementing caption caching and the summary‑prepend chunking strategy. That uplift came from A/B testing with pilot users plus human‑in‑the‑loop validation on a representative QA set—the kind of mixed evaluation that balances real user feedback with controlled quality checks.

Business outcomes observed include faster support resolution times, improved customer satisfaction signals, and lower operational load thanks to automation and a serverless footprint. Each new connector added incremental value because the same pipeline indexed the content without significant reengineering.

Operational considerations: cost, latency, governance and monitoring

Building a production RAG system is half architecture and half operations. Here are practical concerns and mitigation strategies.

Cost control

Cache repeatable LLM outputs (image captions, stable summaries) in DynamoDB or object metadata to reduce inference calls.
Use task‑appropriate models: smaller models for expensive, high‑volume microtasks (captions, summaries) and larger models only for final generation.
Prefer serverless components (Lambda, Aurora Serverless) for variable workloads; evaluate reserved capacity for steady production loads.

Latency & freshness

EventBridge schedules periodic crawls; use incremental crawls for high‑change sources and event‑triggered ingestion for critical updates.
Measure latency budgets for embedding generation and Bedrock calls; cache embeddings for unchanged documents where feasible.

Monitoring, evaluation & drift

Track retrieval precision/recall, user approval rates, hallucination incidents, and model latency over time.
Implement human‑in‑the‑loop triage for low‑confidence answers and a feedback loop that reindexes corrected content.
Automate periodic re‑embedding and re‑summarization as documents change to reduce drift and staleness.

Security & compliance

Enforce zero‑trust: authenticate with Cognito + corporate SSO, authorize at the app layer before returning any content snippets.
Encrypt crawler credentials and PII at rest using KMS; maintain audit trails for content access and model queries for regulatory needs.
Segment index scopes by team or data classification to avoid accidental data leakage between business units.

Aurora Serverless as a vector store: pros, cons and when to switch

Using Amazon Aurora Serverless for vector storage and similarity search has benefits: familiarity for SQL teams, transactional consistency, and predictable integration with existing AWS tooling. However, purpose‑built vector databases (Milvus, Pinecone, Qdrant, Weaviate) often offer faster nearest neighbor search at scale, advanced ANN algorithms, and simpler horizontal scaling for billions of vectors.

Consider staying on Aurora when:

Vector volume is moderate and teams prefer consolidated tooling.
Transactional joins between vectors and relational metadata are common.

Consider migrating when:

Latency and CPU cost of ANN queries grows with large vector counts.
Advanced vector features (index snapshots, dynamic reindexing, hybrid retrieval plugins) become necessary.

Roadmap highlights: where to invest next

Agentic AI/AI agents: enable multi‑step flows that can fetch, validate and act on data—add guardrails, step limits and explainability logs.
Table extraction & structure preservation: preserve tabular structure to answer spreadsheet or invoice queries accurately.
Multilingual indexing: add language detection and translation layers for global teams.
Hybrid retrieval: combine embeddings with lexical signals (BM25) or learned re‑rankers for better precision on short queries.
Event‑triggered updates: push critical changes into the index immediately instead of waiting for scheduled crawls.

Common questions leaders ask

What is a RAG system?

RAG (Retrieval‑Augmented Generation) combines a retrieval layer (searching documents and embeddings) with a generative model that produces answers grounded on retrieved content.
Why cache image captions?

Caching prevents repeated inference on identical images. That saves money and ensures faster retrieval by turning images into searchable text only once.
When should we use Aurora vs a vector DB?

Start with Aurora for moderate scale and tight integration with existing SQL workloads; move to a purpose‑built vector DB when vector count, latency needs, or ANN features exceed Aurora’s sweet spot.
How do you measure improvement?

Combine human QA on a representative test set with live user approval signals, time‑to‑resolution metrics, and CSAT/NPS trends to capture quality and business outcomes.

Checklist for CIOs and CTOs evaluating RAG readiness

Map high‑value knowledge sources (support docs, runbooks, FAQs) and prioritize one or two for a 6–12 week pilot.
Assess data sensitivity and classification—design access scopes before indexing starts.
Define success metrics: time‑to‑resolution, QA approval rate, cost per query, CSAT change.
Choose model and cost guardrails: which tasks use small vs large models; where to cache outputs.
Plan human‑in‑the‑loop workflows for low‑confidence queries and governance escalation.

Actionable next steps

Run a 6‑week pilot on a single high‑impact source (support knowledge base or runbooks) and measure before/after time‑to‑resolution and user approval rates.
Implement caption caching for images and the 70/10/20 chunking heuristic to validate relevance improvements quickly.
Instrument monitoring for precision/recall, inference cost, latency and hallucination incidents; configure alerts for drift.
Plan for governance: SSO integration, KMS key management, audit logs and scoped indexing by data classification.
Design an agentic workflow pilot with explicit safety checks and explainability traces before broad rollout.

Modern enterprise RAG systems are engineering exercises married to operational discipline. Small, well‑chosen engineering moves—task‑specific models, caching, and summary‑prepend chunking—unlock large improvements in relevance and cost. For leaders, the strategic play is simple: start focused, measure impact, bake governance in from day one, and expand with agentic automation where the ROI is clear.

Ready to assess a pilot? Start with one high‑value knowledge source, instrument the right metrics, and run a 6‑week cycle to prove value and refine governance before scaling.