Serverless AI Gateway with AppSync Events & Amazon Bedrock — Per-User Streaming and Token Metering

Build a serverless AI Gateway with AppSync Events and Amazon Bedrock

TL;DR

  • Problem: streaming LLM responses, per-user authorization, and token-level cost control are the plumbing that moves LLM prototypes into production.
  • Pattern: use AppSync Events (AWS WebSocket APIs) as the real-time backbone, Amazon Cognito for identity, Amazon Bedrock for model streaming, Lambda for handlers, and DynamoDB + Firehose → S3 → Glue → Athena for metering and analytics.
  • When to use: you need low-ops scaling, per-user streams, and real-time token metering for cost governance or billing attribution.

Real-world hook

Picture a customer support app that must stream a personalized LLM answer to a single user while billing per token and alerting ops when costs spike. Calling the model directly from every client loses control: hard to enforce per-user privacy, meter consumption, or aggregate behavioral analytics. A serverless AI gateway solves that problem by inserting a controllable middleware layer between clients and foundation models.

Architecture overview

At a high level, the gateway sits between client applications and foundation models. It enforces identity, routes per-user streams over WebSockets, meters token usage in real time, provides structured logs for observability, and optionally caches repeat responses. All of this using serverless building blocks so teams avoid running their own infrastructure fleet.

Core components at a glance

  • AppSync Events — AWS WebSocket APIs for real-time messaging and per-subscriber delivery.
  • Amazon Cognito — identity provider; use the immutable sub (user ID) to namespace channels.
  • AWS Lambda — subscribe/publish handlers and orchestration logic.
  • Amazon Bedrock (Converse/ConverseStream) — foundation models that stream responses and emit token usage metadata.
  • Amazon DynamoDB — atomic token counters + optional prepared-response cache (with TTLs).
  • CloudWatch & Lambda Powertools — structured logging, correlation IDs, and metrics.
  • Kinesis Data Firehose → S3 (Parquet) → Glue → Athena — serverless analytics pipeline.

Sequence (simplified)

  1. Client authenticates with Cognito and opens a WebSocket connection to AppSync Events.
  2. Client publishes a message to an inbound channel namespaced by their Cognito sub (e.g., Inbound-Messages/{sub}).
  3. ChatHandler Lambda validates the channel, forwards the request to Bedrock ConverseStream, and subscribes to the model stream.
  4. Bedrock streams tokens and emits a metadata event with inputTokens/outputTokens/latency.
  5. Gateway writes token counts to DynamoDB (atomic ADD), relays streamed chunks back to the user’s outbound channel, and emits structured logs to Firehose.
  6. Firehose delivers Parquet files to S3; Athena queries provide business analytics and billing attribution.

AppSync Events provides a secure, scalable WebSocket layer that enables low-latency propagation of model events to individual users; combined with Cognito’s immutable sub, you get straightforward per-user channels and authorization.

Channel design & authentication

Use two channels per user: Inbound-Messages/{sub} (client → gateway) and Outbound-Messages/{sub} (gateway → client). The immutable Cognito sub is ideal because it never changes and can be validated in Lambda handlers. SubscribeHandler and ChatHandler should reject any request where the channel namespace doesn’t match the caller’s sub.

Authorization pattern (high level): validate the JWT from Cognito on every connect and message, extract the sub, and ensure the path’s first segment equals that sub. This enforces strict one-to-one streams and prevents cross-user eavesdropping.

Real-time token metering (DynamoDB)

Amazon Bedrock’s ConverseStream emits per-stream metadata like inputTokens, outputTokens, and latencyMs. Capture those metadata events and update atomic counters in DynamoDB so product, finance, and security have per-user, per-period visibility.

Table shape (example)

  • Partition key: user_id (Cognito sub)
  • Sort key: period_id (e.g., 10min:2026-01-26:15:30 or monthly:2026-01)
  • Attributes: input_tokens, output_tokens, total_tokens, ttl

Pseudo-code: atomic update on metadata event

On BedrockMetadataEvent(user_id, inputTokens, outputTokens, totalTokens):
  period10 = format("10min:%Y-%m-%d:%H:%M")
  periodMonth = format("monthly:%Y-%m")
  DynamoDB.UpdateItem(
    Key={user_id, period10},
    UpdateExpression="ADD input_tokens :i, output_tokens :o, total_tokens :t",
    ExpressionAttributeValues={":i": inputTokens, ":o": outputTokens, ":t": totalTokens}
  )
  DynamoDB.UpdateItem(
    Key={user_id, periodMonth},
    UpdateExpression="ADD input_tokens :i, output_tokens :o, total_tokens :t",
    ExpressionAttributeValues={":i": inputTokens, ":o": outputTokens, ":t": totalTokens}
  )

Use TTL on short-lived 10-minute rows to roll data off automatically. For rolling windows (e.g., last 10 minutes), query the recent sorted period keys and sum them. For monthly quotas, check the monthly:YYYY-MM row.

Caching prepared responses — patterns and pitfalls

Caching can drastically reduce model spend for deterministic workloads (FAQ answers, documentation lookups, static tool outputs). Key rules:

  • Scope cache keys to user and context when responses can be personalized (e.g., {user_id}:{prompt_hash}).
  • Set conservative TTLs and add cache versioning so you can invalidate stale content quickly.
  • Never cache PII-bearing outputs that might be queried by other users.

When identical queries are common and safe to share, use a global cache key and keep TTLs longer; otherwise, prefer per-user caches. DynamoDB is a handy, low-latency option for small prepared-response caches.

Caching prepared responses can save significant token costs for repeat, deterministic queries — but always scope cache keys to the user and apply strict TTLs to avoid data leakage.

Serverless analytics: Firehose → S3 (Parquet) → Glue → Athena

Emit structured JSON logs from Lambdas (Lambda Powertools helps). Send these logs to Kinesis Data Firehose which can convert to Parquet and drop partitioned files into S3. Glue Data Catalog makes the dataset queryable by Athena with very low ops overhead.

Sample Athena query — tokens per model, last 30 days

SELECT
  model_id,
  user_id,
  date_trunc('day', event_time) AS day,
  SUM(total_tokens) AS tokens
FROM ai_gateway_logs
WHERE event_time >= current_date - interval '30' day
GROUP BY model_id, user_id, date_trunc('day', event_time)
ORDER BY day DESC
LIMIT 100;

Cost drivers and a worked example

Major cost levers:

  • Model token usage (typically dominant)
  • AppSync Events API operations (connects, publishes)
  • DynamoDB writes for metering and cache
  • Lambda invocations and runtime duration
  • Firehose delivery and storage in S3

Worked example

Bedrock sample rates (example model): $3 per 1M input tokens and $15 per 1M output tokens. If a conversation emits 1,062 input tokens and 512 output tokens (total 1,574):

  • Input cost = 1,062 * $3 / 1,000,000 = $0.00319
  • Output cost = 512 * $15 / 1,000,000 = $0.00768
  • Total per conversation ≈ $0.01087 (~1.1 cents)

Sample development footprint (very light usage) can be in the $35–55/month range, but production costs scale with token consumption and concurrency.

Operational concerns: scale, SLOs, testing, and governance

Key practical points to plan for:

  • Connections & concurrency: AppSync and regional limits impose constraints. Plan for connection sharding across regions or accounts if you expect millions of simultaneous WebSocket clients.
  • Cold starts & latency: Use provisioned concurrency for critical Lambdas or lightweight warmers for chat handlers that need consistent latency.
  • Long-running streams: Design an efficient reconnection strategy and checkpointing when model streams are interrupted.
  • Observability & SLOs: track metrics such as median model latency, connection success rate, tokens/minute, and cost per conversation. Alert on abnormal token spikes and sustained model latency regressions.
  • Testing: unit-test handlers, run integration tests against a mocked Bedrock or sandbox, and perform end-to-end load tests that simulate real conversation patterns.
  • Governance: redact PII in logs, encrypt data at rest, and enforce retention policies in S3 and DynamoDB.

Security checklist

  • Validate Cognito JWTs on every connection and message; verify sub matches the channel namespace.
  • Use TLS for client connections; enable encryption at rest for DynamoDB and S3.
  • Restrict Lambda roles and use least privilege for Bedrock access.
  • Redact or mask user messages in structured logs; only store previews if permitted by policy.
  • Regularly scan cached responses to ensure no sensitive data has been stored inadvertently.

Tradeoffs and evolution path

The serverless approach accelerates time to market and reduces ops, but it creates some vendor coupling. Mitigate lock-in by isolating the Bedrock adapter behind a thin interface so you can swap providers later. As agent orchestration grows in complexity (multi-model, multi-tool workflows), the gateway should evolve to capture step-level observability, policy enforcement, and cost attribution across agent steps.

When this pattern is overkill

  • Small experiments with a handful of users where direct model calls are simpler and cost is immaterial.
  • Ultra-low-latency on-prem workloads where a cloud-managed WebSocket layer is unacceptable.

Checklist & next steps

  1. Deploy the sample repo into a sandbox (see Resources) and validate WebSocket connect/auth flows.
  2. Implement basic token alerting: create CloudWatch alarms for tokens/minute and monthly quota thresholds.
  3. Run integration tests with a Bedrock sandbox or mock and measure end-to-end latency.
  4. Decide caching rules and TTLs with privacy owners; implement safe cache scoping.
  5. Define SLOs/KPIs and add dashboards: connection health, median model latency, tokens per user, and cost per feature.

Key questions — short answers

How do I stream LLM responses securely to individual users?

Use AppSync Events with per-user channels namespaced by Cognito’s immutable sub; validate the JWT and ensure Lambda handlers only allow subscriptions and publishes that match the user’s sub.

How can I meter token usage and implement rate limits?

Consume Bedrock’s stream metadata events, update atomic counters in DynamoDB keyed by user and period, and enforce static (monthly) and rolling (10-minute) windows for quotas and burst protection.

How should I build analytics on conversation data?

Emit structured logs from Lambdas to Firehose, store partitioned Parquet in S3, catalog with Glue, and query with Athena for low-ops analytics and billing reports.

Is caching safe and cost-effective?

Yes for deterministic, non-sensitive queries. Always scope cache keys to the user when content can be personalized, apply strict TTLs, and enforce governance to avoid leakage.

Resources

This serverless AI gateway pattern gives product, security, and finance teams the controls they need—per-user streams, token-level billing visibility, and serverless analytics—while keeping operational overhead low. The next step is to run the sample in a sandbox, define your caching and governance rules, and instrument token alerts so cost surprises don’t interrupt your roadmap.