Hapag‑Lloyd uses Amazon Bedrock to automate customer feedback and speed product decisions

How Hapag‑Lloyd turned customer feedback into actionable product wins with Amazon Bedrock

Short version: Hapag‑Lloyd replaced a slow, manual feedback process with a generative AI automation pipeline built on Amazon Bedrock, OpenSearch, LangChain and serverless AWS. The system processes about 15,000 feedback items per month, classifies sentiment with ~95% accuracy on a labeled test set, and turned weeks‑long decision cycles into days—driving prioritized product changes and measurable reductions in negative feedback.

From CSVs to conversational insight: the business problem

Hapag‑Lloyd runs 313 container ships, operates across 140 countries and receives a steady stream of unstructured customer comments. Until recently, that voice‑of‑customer pipeline arrived as biweekly CSV exports. Product teams spent hours poring through text to find recurring pain points—time that could have gone to strategy and product fixes.

That human bottleneck created two problems: slow time to insight and low bandwidth for continuous improvement. The Digital Customer Experience and Engineering team set out to automate the heavy lifting so people could act on higher‑value work.

“We’re shifting from delivery‑focused work to becoming AI‑native—treating AI as a core capability so we can build smarter products faster.”

— Anna Rysicka, Software Engineer Team Leader, Hapag‑Lloyd

Solution overview: how the pipeline creates decisions in days

Plain English: daily serverless jobs collect raw feedback, foundation models in Amazon Bedrock classify sentiment and summarize text, numeric embeddings let the system find semantically similar reports, and a searchable index plus an internal chatbot lets product teams explore issues instantly.

The pipeline, step by step

  1. Ingestion: AWS Lambda pulls feedback every day and stages it in Amazon S3.
  2. Classification & summarization: Amazon Bedrock invokes foundation models (Claude Sonnet 4.6 was selected for agentic and multi‑turn tasks) to label sentiment and produce concise summaries.
  3. Embedding generation: The same Bedrock layer generates embeddings—numeric vectors that act like content fingerprints, so the system can measure semantic similarity.
  4. Indexing & search: Amazon OpenSearch Service stores text and vectors for full‑text and semantic retrieval (serving dashboards and a conversational agent).
  5. Orchestration & agents: LangChain coordinates the pipeline steps; LangGraph runs a multi‑agent architecture for the internal chatbot that answers complex stakeholder queries.
  6. Delivery & reporting: Product teams get OpenSearch Dashboards and a natural‑language chatbot; automated biweekly reports summarize trends and priorities.

Some engineering details matter because they drive business outcomes: a Cross‑Region Inference Service (CRIS) spans EU regions to handle traffic bursts and data‑residency requirements; guardrails are deployed as infrastructure‑as‑code using CloudFormation; and monitoring/audit trails come from CloudWatch and CloudTrail so model invocations and changes are observable.

Why these choices?

  • Amazon Bedrock gives a managed foundation‑model layer so teams don’t run their own inference stack while still getting modern LLM capabilities for classification, summaries and embeddings.
  • OpenSearch combines full‑text search with vector search capability, making it a single place to store both raw text and embeddings for semantic retrieval.
  • LangChain & LangGraph provide reliable orchestration and multi‑agent flows so the system behaves like an intelligent internal assistant—rather than a brittle script.
  • Serverless (Lambda + S3) keeps ingestion simple and inexpensive at variable scale.

Measurable outcomes and product impact

Operational results:

  • Volume: ~15,000 feedback items processed per month.
  • Accuracy: ~95% sentiment‑classification accuracy on a labeled test set.
  • Speed: Structured summaries produced in seconds; decision cycles shortened from weeks to days.

Concrete product wins followed quickly. Issues surfaced by the system led to prioritizing features like a preview experience and a streamlined Excel upload flow. After implementation, Hapag‑Lloyd saw measurable reductions in negative feedback in those areas.

“The solution now produces structured summaries in seconds, enabling decisions within days rather than weeks.”

— Grzegorz Kaczor, Cloud Architect, Hapag‑Lloyd

Example stakeholder interaction (paraphrased):

Query: “Top complaints about Excel upload in the last 30 days”
Answer: “Primary issues: slow validation, unclear error messages, and format mismatch during peak uploads. Suggested priorities: clearer client‑side validation, sample templates, and server‑side timeout handling.”

Governance, costs and operational tradeoffs

Automation isn’t plug‑and‑play; it shifts responsibilities. The architecture includes guardrails and observability by design, but leadership must plan for privacy, cost and maintenance.

Data protection and privacy

Customer feedback often contains personally identifiable information (PII). Best practices to deploy from day one include automated PII redaction before model ingestion, encryption at rest and transit, strict role‑based access controls, and documented retention and deletion policies aligned with GDPR.

Costs and capacity

Major cost drivers are Bedrock inference, OpenSearch hosting, and cross‑region inference (CRIS). Managed inference reduces operational overhead but can be more expensive per call than self‑hosted models—so start with a pilot, measure cost per item, and optimize batching, caching and embeddings reuse to control spend.

Model lifecycle and human‑in‑the‑loop

Models drift as language and product context evolve. Practical controls include labeled validation sets, ongoing accuracy monitoring, an escalation path for edge cases, and a retraining/versioning cadence. Human review remains essential for low‑confidence classifications and to feed corrected labels back into the pipeline.

Practical checklist for leaders

  • Start small: Pilot on the highest‑volume feedback channel and define success metrics (time‑to‑insight, accuracy, PM hours saved).
  • Protect data first: Implement PII redaction, RBAC, encryption, and retention rules before pushing data to models.
  • Build guardrails as infra: Use IaC (CloudFormation) to deploy consistent Bedrock Guardrails and input validation.
  • Measure cost drivers: Track inference calls, OpenSearch storage, and CRIS usage; optimize with caching and embeddings reuse.
  • Keep humans in the loop: Define when a human must review and how corrections get back into training data.
  • Plan for scale: Use cross‑region inference for resilience and design reports and dashboards for non‑technical stakeholders.

Implementation timeline: from pilot to scale

  • 0–60 days: Identify top feedback source, build simple ingestion, run Bedrock classification on a sampled set, measure baseline accuracy and time‑to‑insight.
  • 60–120 days: Add embeddings and OpenSearch indexing, build dashboards and a chatbot prototype, test guardrails and PII redaction.
  • 120–180 days: Roll out automated reports, set up CRIS for resilience, formalize retraining cadence, and scale to other feedback channels.

Common pitfalls and counterpoints

Don’t assume a managed service solves governance or cost issues automatically. Managed models simplify operations but create new monitoring needs: you’ll still need to log invocations, version prompts, and track drift. Embeddings and vector search improve semantic retrieval, but they aren’t a silver bullet for explainability—teams should surface the evidence behind recommendations (the original snippets and confidence scores).

Also consider vendor lock‑in risk. Using a managed service like Amazon Bedrock accelerates time to value, but organizations with strict control or cost objectives might prefer self‑hosted models in the long run. A hybrid strategy—start managed, migrate specialized workloads later—is a pragmatic option.

Key questions leaders ask

How is sensitive customer data protected?
PII handling and redaction are essential: mask or remove identifiers before model calls, enforce encryption and RBAC, and implement retention/deletion policies aligned with regulation.

Can this pipeline scale with global operations?
Yes—CRIS across regions and OpenSearch provide scalable building blocks, but expect to budget for cross‑region inference and storage growth as volume increases.

Will the models degrade over time?
Model drift is real. Monitor accuracy and input distributions, maintain labeled validation sets, and use human review for low‑confidence cases. Plan for periodic retraining or prompt/version updates.

Can the approach be generalized beyond product feedback?
Absolutely. Sales, support, compliance and other teams can reuse the same pattern—Bedrock for language tasks, OpenSearch for semantic retrieval, orchestrated by LangChain—with domain‑specific prompts and guardrails.

Takeaways

Hapag‑Lloyd’s approach is a repeatable blueprint for turning unstructured feedback into prioritized product action: use generative AI (Amazon Bedrock) for language tasks, store embeddings in a vector‑enabled search (OpenSearch), orchestrate reliably with LangChain/LangGraph, and bake governance and observability into infrastructure from day one. Expect faster decisions, clearer prioritization and a shift in people’s work from repetitive analysis to strategic product improvements—while planning for privacy, cost and model maintenance.

Quick wins: start with one feedback channel, measure time‑to‑insight, enforce PII redaction, and push the first automated report into the hands of a product owner within 60 days.