Infrastructure DEC 15, 2024 5 min read

Zero-Downtime Kubeflow Upgrades: A Survival Guide

The 3am Call Nobody Wants

It was 2:47am when PagerDuty lit up. Our ML training pipelines had gone dark—not gradually degraded, but gone. The culprit? A rushed Kubeflow upgrade from 1.6 to 1.9 that we’d scheduled during the “safe window” (spoiler: there’s no such thing).

After a painful 40-minute rollback and two cups of coffee I didn’t need, I decided this wouldn’t happen again. This is what I learned upgrading Kubeflow cleanly in production.

The Pre-Game: Understanding Your Risk Surface

Before touching anything in production, map your dependencies. For us, this meant:

Namespace isolation: Kubeflow core components in kubeflow namespace, user workloads in ml-workloads
Storage layer: Persistent volumes for training artifacts, model registries, and Kubeflow metadata
API consumers: Jenkins jobs, our internal ML CLI, and data science notebooks all hitting Kubeflow APIs

The key insight: Kubeflow upgrades touch a lot of moving pieces simultaneously. You need a clear picture of what breaks if each piece fails.

Step 1: The Cluster Dry Run

Never test upgrades in prod. I cannot stress this enough.

We created a staging cluster with:

# Use the exact same cluster config as production
kops create cluster \
  --state=s3://kops-state \
  --name=kubeflow-staging.k8s.local \
  --zones=us-east-1a,us-east-1b \
  --master-zones=us-east-1a \
  --node-count=4 \
  --node-size=t3.xlarge \
  --kubernetes-version=1.26

Deployed production workloads (anonymized datasets, subset of jobs) and ran the full 1.6→1.9 upgrade sequence. This caught three issues we’d never have found otherwise.

Step 2: Blue-Green Namespaces

Instead of upgrading in-place, we deployed Kubeflow 1.9 to a parallel namespace:

# Deploy 1.9 to kubeflow-new namespace
apiVersion: v1
kind: Namespace
metadata:
  name: kubeflow-new
  labels:
    version: "1.9"
---
# Use kfctl to deploy 1.9 here
# Then run validation tests

This let us:

Keep the old system running (1.6 in kubeflow namespace)
Validate 1.9 independently
Smoke test API compatibility
Switch traffic only after validation passed

Step 3: The Graceful Transition

The real trick is handling in-flight workloads. We implemented a two-phase cutover:

Phase 1: New jobs go to 1.9

Modified our job submission layer to route all new pipelines to the 1.9 APIs:

# In our ML job submission service
def get_kubeflow_host():
    if os.getenv('KUBEFLOW_VERSION') == '1.9':
        return 'kubeflow-new.ml-workloads.svc.cluster.local'
    else:
        return 'kubeflow.ml-workloads.svc.cluster.local'

def submit_job(job_spec):
    host = get_kubeflow_host()
    # Submit to appropriate Kubeflow instance

This meant new experiments started on 1.9 immediately, while existing runs completed on 1.6.

Phase 2: Migrate running jobs

For long-running training jobs, we:

Paused checkpoint uploads to S3 on 1.6
Exported job metadata and configurations
Re-imported to 1.9 instances
Resumed from the same checkpoint

# Export metadata for active jobs
kubectl get kubeflowexperiments -n kubeflow -o yaml > experiments-backup.yaml

# Wait for in-flight jobs to checkpoint gracefully (we waited 30 mins)
# Then import to new namespace
kubectl apply -f experiments-backup.yaml -n kubeflow-new

Step 4: The DNS Cutover (The Scary Part)

Once everything validated, we updated service DNS:

# Patch the service to point to 1.9 components
apiVersion: v1
kind: Service
metadata:
  name: ml-pipeline
  namespace: kubeflow
spec:
  selector:
    app: ml-pipeline
    version: "1.9"  # Changed from 1.6

The magic: because our job submission code used Kubernetes service DNS, it immediately routed to 1.9 without any client changes.

Step 5: Validation Gates

We didn’t just hope things worked. For 2 hours, we monitored:

# Watch job success rates
watch kubectl get experiments -A

# Check Kubeflow API latency
kubectl logs -f -l app=ml-pipeline -n kubeflow | grep "request_duration"

# Verify checkpoint integrity
s3 ls s3://ml-artifacts/checkpoints/ | wc -l

Only after the success metrics stayed green did we decommission the old namespace.

The Lessons

1. Blue-green isn’t just for apps—use it for platform upgrades too. Running parallel versions eliminated the binary success/failure scenario.

2. Workload routing is your best friend. Because we had a service layer, not point-to-point connections, the DNS cutover was invisible.

3. Kubernetes namespaces are cheap. Use them for staging. The parallel infrastructure cost us maybe $50/day in staging, which was trivial compared to the risk.

4. Export your state. Always. If something goes sideways, you want a clear recovery path.

We’ve now upgraded to 1.10 and 1.11 using this same pattern with zero incidents. The overhead is real (extra infrastructure, validation time), but the peace of mind is worth every penny.

If you’re staring at a Kubeflow upgrade and sweating, reach out. I’ve got the playbooks.

What’s your production platform upgrade horror story? Drop it in the comments—I’m collecting them for my next post.