← Back to Writing

Building a Production MLOps Pipeline on Kubernetes

The Problem

Most ML portfolio projects stop at model.fit(). In production, that's where the work begins. Enterprise AI teams need someone who owns the full platform lifecycle: experiment tracking, model versioning, automated retraining, quality gates, serving infrastructure, and monitoring. This project is the platform that an AI platform team builds and operates.

What It Does

A complete ML lifecycle pipeline covering every stage from training to production monitoring. It fine-tunes TinyLlama (1.1B parameters) on security Q&A data, tracks experiments in MLflow, orchestrates the pipeline with Argo Workflows, serves models through vLLM, and monitors everything with Prometheus and Grafana.

Architecture

The pipeline follows a 6-stage DAG where each step runs as an isolated container with S3-backed artifact passing:

  1. Ingest Data: Pull training data from source, validate schema, store in MinIO (S3-compatible)
  2. Preprocess: Tokenize, format for instruction tuning, split train/eval sets
  3. Train: Fine-tune TinyLlama with PyTorch + Transformers, log metrics to MLflow
  4. Evaluate: Run evaluation suite, compare against baseline thresholds (quality gate)
  5. Register: If evaluation passes, register model version in MLflow Model Registry
  6. Notify: Alert on completion with pass/fail status and key metrics

The evaluation step is the critical quality gate. Models that don't meet baseline thresholds are automatically rejected. No manual approval needed, no bad models reaching production.

Tech Stack

LayerToolWhy
TrainingPyTorch + TransformersIndustry standard for LLM fine-tuning
Experiment TrackingMLflowOpen-source, multi-cloud, S3-compatible backend
Artifact StorageMinIOS3 API locally, swaps to real S3 in production
MetadataPostgreSQLMLflow backend store for experiment metadata
OrchestrationArgo WorkflowsK8s-native DAGs, each step is a pod
ServingvLLMPagedAttention, continuous batching, tensor parallelism
MonitoringPrometheus + GrafanaRequest latency, throughput, GPU utilization
GitOpsArgoCDDeclarative deployments, automatic sync

Key Design Decisions

Why MLflow Over SageMaker Model Registry?

MLflow is open-source and works with any S3-compatible backend (MinIO, GCS, R2). It avoids AWS lock-in and gives teams full control over the tracking server. For a platform engineer, showing that you can build and operate the ML infrastructure yourself is more valuable than clicking through a managed console.

Why Argo Workflows Over Airflow?

Argo Workflows is Kubernetes-native. Each step in the pipeline is a pod, artifacts are stored in S3, and the DAG definition is a CRD. There's no separate scheduler, webserver, or database to manage. For teams already running K8s, Argo fits naturally into the existing operational model.

Why vLLM Over KServe?

vLLM is purpose-built for LLM inference. PagedAttention reduces memory waste by 60-80%, continuous batching maximizes GPU utilization, and tensor parallelism enables multi-GPU serving. KServe is a great general-purpose serving framework, but for LLM-specific workloads, vLLM delivers better throughput with a simpler operational surface.

Automated Quality Gates

The evaluation step runs a suite of metrics against configurable thresholds. If a model doesn't meet the bar, the pipeline stops. No human needs to review every training run. This is how production ML teams prevent model regression: automated gates, not manual checks.

Two Deployment Paths

The project supports both local development and production Kubernetes deployment:

  • Local: Docker Compose stack with MLflow, PostgreSQL, MinIO, Prometheus, and Grafana. Run training with local Python, track experiments in the browser
  • Production: Full K8s manifests across five directories: mlflow/, vllm/, argo-workflows/, monitoring/, argocd/. HPA autoscaling on the vLLM deployment, ServiceMonitors for Prometheus, Grafana dashboard ConfigMaps

Monitoring

The Grafana dashboard tracks three categories of metrics:

  • Serving metrics: Request latency (p50/p95/p99), throughput (requests/sec), error rates
  • Resource metrics: GPU cache utilization, memory pressure, pod autoscaling events
  • Pipeline metrics: Training duration, evaluation scores over time, model promotion history

Automated Retraining

A CronWorkflow schedules pipeline runs on a configurable cadence. When new training data arrives, the pipeline runs automatically: ingest, preprocess, train, evaluate, and (if the quality gate passes) register and deploy. RBAC configuration ensures the workflow has minimal permissions to run training jobs and update the model registry.

What I'd Do Differently

  • Feature store: Add Feast for feature management and serving, ensuring training/serving feature parity
  • A/B testing: Canary deployments for new model versions with traffic splitting and automated rollback on metric degradation
  • Data versioning: Integrate DVC for dataset versioning alongside model versioning
  • Cost tracking: Add per-experiment cost attribution (GPU hours, storage, inference costs) to MLflow as custom metrics

Takeaway

MLOps isn't about any single tool. It's about building a platform where data scientists can train, evaluate, and deploy models without filing tickets. The pipeline handles versioning, quality gates, serving, and monitoring. The platform team (that's me) handles the infrastructure, and the ML team ships models. That's the division of labor that scales.