The Problem
When a company starts using LLMs in production, the same problems appear regardless of which model they pick:
- Provider lock-in: code calls
openai.chat.completions.create()directly, so switching to Bedrock or Anthropic means touching every service - No cost visibility: finance can't tell which team or feature is burning the AWS Bedrock bill
- No rate limiting per consumer: one runaway notebook hammers the API and breaks production for everyone else
- No fallback story: Bedrock has a regional outage, the whole product is down
- No PII boundary: user inputs flow straight to a third-party model with no scanning
- No cache: the same five prompts generate Claude calls all day
I built a platform layer that solves all of these at once. Applications speak the OpenAI Chat Completions API to the gateway and get multi-provider routing, automatic failover, per-tenant quotas, cost attribution, PII guardrails, and semantic caching — for free.
Architecture
The system runs on AWS EKS with a FastAPI gateway, Redis for rate limiting and caching, RDS Postgres for usage history, and an ArgoCD GitOps delivery model. Upstream providers are AWS Bedrock (Claude 3.5/4) and OpenAI (GPT-4o family).
- Gateway: FastAPI (Python 3.13). Handles auth, quotas, guardrails, caching, routing, usage recording, and telemetry
- Provider abstraction: Every upstream (Bedrock, OpenAI) implements a four-method ABC. Adding a new provider is a single PR with no router changes
- State: Redis for sliding-window rate limits and semantic cache; RDS Postgres for structured usage history
- GitOps: ArgoCD app-of-apps pattern — every workload (gateway, Redis, observability, security) syncs from Git
- Infrastructure: VPC, EKS 1.34, Karpenter autoscaling, IRSA for Bedrock access, Secrets Manager via External Secrets Operator
Request Lifecycle
Every call to POST /v1/chat/completions goes through eight steps:
- Auth:
Authorization: Bearer sk-<tenant>-...is SHA-256 hashed and looked up against the tenant keystore - Quotas: Sliding-window RPM check first (rejects with 429 + Retry-After), then token budget pre-check
- Model allowlist: If the tenant isn't authorized for the requested model, 403 before contacting any provider
- Guardrails: Messages are regex-scanned for PII (SSN, credit card, email, US phone). Matches are replaced with
[REDACTED-{type}]and counted as a Prometheus event - Cache: Deterministic key over (model, canonicalized messages, sampling params). Hits return immediately with
cache="hit"in the response - Routing: Concrete model → providers that serve it. Aliases (
auto,cheapest,fastest) → the price-ranked default chain. Fallback walks the chain on retryable errors only - Charge + persist: Token usage is debited against the tenant's TPM budget and a
UsageRecordis written to Postgres - Telemetry: Prometheus counters/histograms and OpenTelemetry
gen_ai.*span attributes emitted on the way out
Key Design Decisions
OpenAI-compatible API
Any team already using the OpenAI SDK can migrate by changing one environment variable — the base URL. This is the most realistic internal adoption path. Requiring teams to rewrite client code would make the gateway a hard sell.
Provider abstraction as an ABC
I designed the provider interface so that adding Anthropic direct, Azure OpenAI, or Google Vertex is a focused PR —
implement four methods (supports_model, list_models, chat, health).
No router changes. No schema changes. This is the key extensibility point.
Separate RPM and TPM limits
These are independent failure modes. A small number of huge prompts can exhaust the monthly token budget without ever triggering the request-rate limit. Tracking both separately means you can give a team a high RPM limit for low-latency tooling while still capping their token spend on expensive models.
Redis sliding window via sorted sets
Fixed-window rate limiting has edge-case burst problems (double the rate at window boundaries).
I implemented sliding windows via Redis sorted sets — ZADD + ZREMRANGEBYSCORE + ZCARD
as a Lua script. O(log N), atomic, no clock-drift issues, scales to multiple gateway pods behind a load balancer.
URL-encoding credentials in code, not YAML
This one cost me hours. The RDS-managed password contained #, ?, and () characters.
When I tried to compose the database URL in Kubernetes env vars using $(DB_PASSWORD), the URL parser
treated # as a fragment delimiter and silently truncated the password.
The fix: compose the URL in Python using urllib.parse.quote_plus on each component.
Kubernetes $(VAR) expansion is plain text substitution — never use it to assemble URLs containing
credentials or user-supplied data.
Aliases vs. concrete models
If a caller asks for gpt-4o, they get GPT-4o or a 503. They never silently get Claude.
Quietly substituting models breaks evaluations, reproducibility, and cost forecasts.
Aliases (auto, cheapest, fastest) are explicitly opt-in.
Observability Stack
The full observability stack deploys alongside the application via ArgoCD:
Prometheus + Grafana + Loki + Tempo + OpenTelemetry Collector in the monitoring namespace.
LLM-specific metrics I expose:
- Request rate and P95 latency by provider and upstream model
- Token volumes (prompt vs. completion, by tenant)
- USD cost per request, per tenant, per model — computed in the gateway from a maintained price table
- Cache hit rate — what fraction of requests avoided a provider call
- Guardrail events — PII type counts per tenant, for compliance audit
- Circuit breaker state — per provider, open/closed/half-open transitions
OpenTelemetry traces use the GenAI semantic conventions (gen_ai.system,
gen_ai.response.model, gen_ai.usage.input_tokens) so spans drop
straight into Tempo or any vendor LLM observability view without extra configuration.
Platform Lessons Learned
RDS ManageMasterUserPassword is a different code path
When manage_master_user_password = true, RDS generates the password and stores it in
an RDS-owned Secrets Manager entry. Terraform's random_password resource is completely bypassed.
I spent two hours chasing an InvalidPasswordError before realising the password I was reading
from our app secret and the password RDS was actually using were different values.
Lesson: when using RDS-managed passwords, the source of truth is the RDS-owned secret, not your terraform state.
Bitnami container images moved in April 2025
Broadcom deprecated public access to docker.io/bitnami/* tags. Old tags moved to
docker.io/bitnamilegacy/* with a guard that rejects the rename unless you set
global.security.allowInsecureImages: true in Helm values.
Any project pinning Bitnami charts without locking image registries will hit this silently weeks after deploy.
Kyverno restricted policies must carve out system namespaces
Cluster-wide restricted policies (no root, seccomp required, no hostPath) block Velero, kubecost,
and kube-system workloads that legitimately need those capabilities. The fix:
scope policyExclude to [kube-system, velero, kyverno, monitoring, cert-manager]
and start with failurePolicy: Ignore so a misconfiguration doesn't block all workloads.
ExternalSecrets ownership is single-owner
Two ExternalSecrets targeting the same Kubernetes Secret — even with creationPolicy: Merge
— race on the reconcile.external-secrets.io/managed-by label. Only one can own the Secret.
Decide which chart owns each Secret and have the others only mount it.
What I'd Do Differently
- Streaming responses: Cache SARIF is easy; caching SSE streams cleanly requires a different design. I'd add streaming cache as a follow-up after the non-streaming path is validated
- Semantic cache: The current cache uses exact key matching (SHA-256 over the prompt). The embeddings-based semantic cache path is implemented but not production-enabled — it needs a vector store and similarity threshold tuning
- Model lifecycle: The gateway handles inference. Add MLflow model registry integration so provider model IDs are managed as versioned artifacts rather than config strings
- Workflow orchestration: Multi-step AI tasks (summarize → classify → route) should be Temporal workflows, not nested HTTP calls
Results
The gateway processes completions end-to-end in under 800ms overhead (excluding model latency),
enforces per-tenant quotas across multiple concurrent pods, and records structured cost and token
data for every request. The full platform — EKS cluster, Redis, RDS, observability stack,
security tooling — deploys from scratch with a single python RUNME.py all command.
More than anything, this project demonstrates what platform engineering for AI actually looks like: not model research, but the infrastructure layer that makes LLMs safe, cost-controlled, and observable when running at company scale.