The Problem
Most ML portfolio projects stop at model.fit(). In production, that's where
the work begins. Enterprise AI teams need someone who owns the full platform lifecycle:
experiment tracking, model versioning, automated retraining, quality gates, serving
infrastructure, and monitoring. This project is the platform that an AI platform team
builds and operates.
What It Does
A complete ML lifecycle pipeline covering every stage from training to production monitoring. It fine-tunes TinyLlama (1.1B parameters) on security Q&A data, tracks experiments in MLflow, orchestrates the pipeline with Argo Workflows, serves models through vLLM, and monitors everything with Prometheus and Grafana.
Architecture
The pipeline follows a 6-stage DAG where each step runs as an isolated container with S3-backed artifact passing:
- Ingest Data: Pull training data from source, validate schema, store in MinIO (S3-compatible)
- Preprocess: Tokenize, format for instruction tuning, split train/eval sets
- Train: Fine-tune TinyLlama with PyTorch + Transformers, log metrics to MLflow
- Evaluate: Run evaluation suite, compare against baseline thresholds (quality gate)
- Register: If evaluation passes, register model version in MLflow Model Registry
- Notify: Alert on completion with pass/fail status and key metrics
The evaluation step is the critical quality gate. Models that don't meet baseline thresholds are automatically rejected. No manual approval needed, no bad models reaching production.
Tech Stack
| Layer | Tool | Why |
|---|---|---|
| Training | PyTorch + Transformers | Industry standard for LLM fine-tuning |
| Experiment Tracking | MLflow | Open-source, multi-cloud, S3-compatible backend |
| Artifact Storage | MinIO | S3 API locally, swaps to real S3 in production |
| Metadata | PostgreSQL | MLflow backend store for experiment metadata |
| Orchestration | Argo Workflows | K8s-native DAGs, each step is a pod |
| Serving | vLLM | PagedAttention, continuous batching, tensor parallelism |
| Monitoring | Prometheus + Grafana | Request latency, throughput, GPU utilization |
| GitOps | ArgoCD | Declarative deployments, automatic sync |
Key Design Decisions
Why MLflow Over SageMaker Model Registry?
MLflow is open-source and works with any S3-compatible backend (MinIO, GCS, R2). It avoids AWS lock-in and gives teams full control over the tracking server. For a platform engineer, showing that you can build and operate the ML infrastructure yourself is more valuable than clicking through a managed console.
Why Argo Workflows Over Airflow?
Argo Workflows is Kubernetes-native. Each step in the pipeline is a pod, artifacts are stored in S3, and the DAG definition is a CRD. There's no separate scheduler, webserver, or database to manage. For teams already running K8s, Argo fits naturally into the existing operational model.
Why vLLM Over KServe?
vLLM is purpose-built for LLM inference. PagedAttention reduces memory waste by 60-80%, continuous batching maximizes GPU utilization, and tensor parallelism enables multi-GPU serving. KServe is a great general-purpose serving framework, but for LLM-specific workloads, vLLM delivers better throughput with a simpler operational surface.
Automated Quality Gates
The evaluation step runs a suite of metrics against configurable thresholds. If a model doesn't meet the bar, the pipeline stops. No human needs to review every training run. This is how production ML teams prevent model regression: automated gates, not manual checks.
Two Deployment Paths
The project supports both local development and production Kubernetes deployment:
- Local: Docker Compose stack with MLflow, PostgreSQL, MinIO, Prometheus, and Grafana. Run training with local Python, track experiments in the browser
- Production: Full K8s manifests across five directories: mlflow/, vllm/, argo-workflows/, monitoring/, argocd/. HPA autoscaling on the vLLM deployment, ServiceMonitors for Prometheus, Grafana dashboard ConfigMaps
Monitoring
The Grafana dashboard tracks three categories of metrics:
- Serving metrics: Request latency (p50/p95/p99), throughput (requests/sec), error rates
- Resource metrics: GPU cache utilization, memory pressure, pod autoscaling events
- Pipeline metrics: Training duration, evaluation scores over time, model promotion history
Automated Retraining
A CronWorkflow schedules pipeline runs on a configurable cadence. When new training data arrives, the pipeline runs automatically: ingest, preprocess, train, evaluate, and (if the quality gate passes) register and deploy. RBAC configuration ensures the workflow has minimal permissions to run training jobs and update the model registry.
What I'd Do Differently
- Feature store: Add Feast for feature management and serving, ensuring training/serving feature parity
- A/B testing: Canary deployments for new model versions with traffic splitting and automated rollback on metric degradation
- Data versioning: Integrate DVC for dataset versioning alongside model versioning
- Cost tracking: Add per-experiment cost attribution (GPU hours, storage, inference costs) to MLflow as custom metrics
Takeaway
MLOps isn't about any single tool. It's about building a platform where data scientists can train, evaluate, and deploy models without filing tickets. The pipeline handles versioning, quality gates, serving, and monitoring. The platform team (that's me) handles the infrastructure, and the ML team ships models. That's the division of labor that scales.