← Back to Writing

Building a Production LLM Inference Platform on AWS EKS

The Problem

When a company starts using LLMs in production, the same problems appear regardless of which model they pick:

  • Provider lock-in: code calls openai.chat.completions.create() directly, so switching to Bedrock or Anthropic means touching every service
  • No cost visibility: finance can't tell which team or feature is burning the AWS Bedrock bill
  • No rate limiting per consumer: one runaway notebook hammers the API and breaks production for everyone else
  • No fallback story: Bedrock has a regional outage, the whole product is down
  • No PII boundary: user inputs flow straight to a third-party model with no scanning
  • No cache: the same five prompts generate Claude calls all day

I built a platform layer that solves all of these at once. Applications speak the OpenAI Chat Completions API to the gateway and get multi-provider routing, automatic failover, per-tenant quotas, cost attribution, PII guardrails, and semantic caching — for free.

Architecture

The system runs on AWS EKS with a FastAPI gateway, Redis for rate limiting and caching, RDS Postgres for usage history, and an ArgoCD GitOps delivery model. Upstream providers are AWS Bedrock (Claude 3.5/4) and OpenAI (GPT-4o family).

  • Gateway: FastAPI (Python 3.13). Handles auth, quotas, guardrails, caching, routing, usage recording, and telemetry
  • Provider abstraction: Every upstream (Bedrock, OpenAI) implements a four-method ABC. Adding a new provider is a single PR with no router changes
  • State: Redis for sliding-window rate limits and semantic cache; RDS Postgres for structured usage history
  • GitOps: ArgoCD app-of-apps pattern — every workload (gateway, Redis, observability, security) syncs from Git
  • Infrastructure: VPC, EKS 1.34, Karpenter autoscaling, IRSA for Bedrock access, Secrets Manager via External Secrets Operator

Request Lifecycle

Every call to POST /v1/chat/completions goes through eight steps:

  1. Auth: Authorization: Bearer sk-<tenant>-... is SHA-256 hashed and looked up against the tenant keystore
  2. Quotas: Sliding-window RPM check first (rejects with 429 + Retry-After), then token budget pre-check
  3. Model allowlist: If the tenant isn't authorized for the requested model, 403 before contacting any provider
  4. Guardrails: Messages are regex-scanned for PII (SSN, credit card, email, US phone). Matches are replaced with [REDACTED-{type}] and counted as a Prometheus event
  5. Cache: Deterministic key over (model, canonicalized messages, sampling params). Hits return immediately with cache="hit" in the response
  6. Routing: Concrete model → providers that serve it. Aliases (auto, cheapest, fastest) → the price-ranked default chain. Fallback walks the chain on retryable errors only
  7. Charge + persist: Token usage is debited against the tenant's TPM budget and a UsageRecord is written to Postgres
  8. Telemetry: Prometheus counters/histograms and OpenTelemetry gen_ai.* span attributes emitted on the way out

Key Design Decisions

OpenAI-compatible API

Any team already using the OpenAI SDK can migrate by changing one environment variable — the base URL. This is the most realistic internal adoption path. Requiring teams to rewrite client code would make the gateway a hard sell.

Provider abstraction as an ABC

I designed the provider interface so that adding Anthropic direct, Azure OpenAI, or Google Vertex is a focused PR — implement four methods (supports_model, list_models, chat, health). No router changes. No schema changes. This is the key extensibility point.

Separate RPM and TPM limits

These are independent failure modes. A small number of huge prompts can exhaust the monthly token budget without ever triggering the request-rate limit. Tracking both separately means you can give a team a high RPM limit for low-latency tooling while still capping their token spend on expensive models.

Redis sliding window via sorted sets

Fixed-window rate limiting has edge-case burst problems (double the rate at window boundaries). I implemented sliding windows via Redis sorted sets — ZADD + ZREMRANGEBYSCORE + ZCARD as a Lua script. O(log N), atomic, no clock-drift issues, scales to multiple gateway pods behind a load balancer.

URL-encoding credentials in code, not YAML

This one cost me hours. The RDS-managed password contained #, ?, and () characters. When I tried to compose the database URL in Kubernetes env vars using $(DB_PASSWORD), the URL parser treated # as a fragment delimiter and silently truncated the password. The fix: compose the URL in Python using urllib.parse.quote_plus on each component. Kubernetes $(VAR) expansion is plain text substitution — never use it to assemble URLs containing credentials or user-supplied data.

Aliases vs. concrete models

If a caller asks for gpt-4o, they get GPT-4o or a 503. They never silently get Claude. Quietly substituting models breaks evaluations, reproducibility, and cost forecasts. Aliases (auto, cheapest, fastest) are explicitly opt-in.

Observability Stack

The full observability stack deploys alongside the application via ArgoCD: Prometheus + Grafana + Loki + Tempo + OpenTelemetry Collector in the monitoring namespace.

LLM-specific metrics I expose:

  • Request rate and P95 latency by provider and upstream model
  • Token volumes (prompt vs. completion, by tenant)
  • USD cost per request, per tenant, per model — computed in the gateway from a maintained price table
  • Cache hit rate — what fraction of requests avoided a provider call
  • Guardrail events — PII type counts per tenant, for compliance audit
  • Circuit breaker state — per provider, open/closed/half-open transitions

OpenTelemetry traces use the GenAI semantic conventions (gen_ai.system, gen_ai.response.model, gen_ai.usage.input_tokens) so spans drop straight into Tempo or any vendor LLM observability view without extra configuration.

Platform Lessons Learned

RDS ManageMasterUserPassword is a different code path

When manage_master_user_password = true, RDS generates the password and stores it in an RDS-owned Secrets Manager entry. Terraform's random_password resource is completely bypassed. I spent two hours chasing an InvalidPasswordError before realising the password I was reading from our app secret and the password RDS was actually using were different values. Lesson: when using RDS-managed passwords, the source of truth is the RDS-owned secret, not your terraform state.

Bitnami container images moved in April 2025

Broadcom deprecated public access to docker.io/bitnami/* tags. Old tags moved to docker.io/bitnamilegacy/* with a guard that rejects the rename unless you set global.security.allowInsecureImages: true in Helm values. Any project pinning Bitnami charts without locking image registries will hit this silently weeks after deploy.

Kyverno restricted policies must carve out system namespaces

Cluster-wide restricted policies (no root, seccomp required, no hostPath) block Velero, kubecost, and kube-system workloads that legitimately need those capabilities. The fix: scope policyExclude to [kube-system, velero, kyverno, monitoring, cert-manager] and start with failurePolicy: Ignore so a misconfiguration doesn't block all workloads.

ExternalSecrets ownership is single-owner

Two ExternalSecrets targeting the same Kubernetes Secret — even with creationPolicy: Merge — race on the reconcile.external-secrets.io/managed-by label. Only one can own the Secret. Decide which chart owns each Secret and have the others only mount it.

What I'd Do Differently

  • Streaming responses: Cache SARIF is easy; caching SSE streams cleanly requires a different design. I'd add streaming cache as a follow-up after the non-streaming path is validated
  • Semantic cache: The current cache uses exact key matching (SHA-256 over the prompt). The embeddings-based semantic cache path is implemented but not production-enabled — it needs a vector store and similarity threshold tuning
  • Model lifecycle: The gateway handles inference. Add MLflow model registry integration so provider model IDs are managed as versioned artifacts rather than config strings
  • Workflow orchestration: Multi-step AI tasks (summarize → classify → route) should be Temporal workflows, not nested HTTP calls

Results

The gateway processes completions end-to-end in under 800ms overhead (excluding model latency), enforces per-tenant quotas across multiple concurrent pods, and records structured cost and token data for every request. The full platform — EKS cluster, Redis, RDS, observability stack, security tooling — deploys from scratch with a single python RUNME.py all command.

More than anything, this project demonstrates what platform engineering for AI actually looks like: not model research, but the infrastructure layer that makes LLMs safe, cost-controlled, and observable when running at company scale.