Blue-Green and Canary Deployments: Shipping Without Fear

The scariest deploy I ever shipped took eleven minutes to roll back. Eleven minutes of a payment service throwing 500s on roughly 4% of checkout requests while I waited for a kubectl rollout undo to drain old pods, pull the previous image, and pass health checks. We lost real money in those eleven minutes. The bug was trivial — a null check I'd missed in a refactor. The damage wasn't the bug. The damage was that rollback was slow.

That's the whole game. A deploy you can't reverse in seconds is a deploy you're afraid to make, and fear makes you batch changes, which makes the next deploy bigger and scarier. The way out isn't more careful engineers. It's a deployment strategy where reverting is cheaper than thinking.

The four strategies, ranked by blast radius

Every deployment strategy is a different answer to one question: how many users see the new version before you know it's good?

Strategy	Downtime	Rollback speed	Extra cost	Blast radius if broken
Recreate	Full (seconds–minutes)	Slow (redeploy old)	None	100% during window
Rolling	Zero	Slow (roll back pods)	~1 surge pod	Grows as pods cycle
Blue-green	Zero	Instant (swap router)	2x environment	100% — but reversible in 1 step
Canary	Zero	Instant (shift weight)	~1 extra replica set	1–5% until promoted

Recreate kills the old version, then starts the new one. There's a gap where nothing serves traffic. It's the right call for exactly one situation: a stateful app that can't run two versions against the same data at once (some single-writer migrations, certain desktop-style backends). Otherwise, never.

Rolling is the Kubernetes default and what most teams ship by accident. It replaces pods a few at a time. Zero downtime, no extra environment — but two things bite you. First, for a stretch you're serving both versions simultaneously, so your old and new code must be mutually compatible. Second, rollback means rolling backward pod by pod, which is exactly the slow path that cost me eleven minutes.

Blue-green and canary both fix the rollback problem by separating "deploy the new version" from "send it traffic." That separation is the entire point, and it's worth understanding why each does it differently.

Blue-green: two full environments, one router

Run two complete production environments. Blue is live. You deploy the new version to green, idle, serving zero real traffic. You smoke-test green against production dependencies. When green looks good, you flip the router — load balancer target group, Service selector, ingress weight — and 100% of traffic moves to green in one atomic change. Blue stays warm.

Something breaks? Flip back. Rollback is a single routing change, sub-second, no image pulls, no pod scheduling. That's the property that removes the fear.

In Kubernetes the cheapest version is a label swap on a Service selector:

# Two Deployments: app-blue (version 1.4.2) and app-green (version 1.5.0)
# The Service decides who is live by matching one label.
apiVersion: v1
kind: Service
metadata:
  name: checkout
spec:
  selector:
    app: checkout
    slot: green   # flip to "blue" to roll back — one field, instant
  ports:
    - port: 80
      targetPort: 8080

kubectl patch svc checkout -p '{"spec":{"selector":{"slot":"blue"}}}' is your rollback. It resolves the moment kube-proxy reprograms iptables/IPVS, typically under a second.

The cost is real: you pay for two production-sized environments. For a service running 20 pods, that's 40 during a deploy. Tools like Argo Rollouts mitigate this by only scaling the preview environment up around the cutover rather than running both hot 24/7.

The genuinely hard part of blue-green isn't the routing. It's the database.

The database is shared, and that changes everything

Your router can be blue-green. Your Postgres cannot — both environments hit the same database. So if green ships a migration that drops a column, and you have to roll back to blue, blue's code now queries a column that's gone. Your "instant rollback" just became an outage with extra steps.

The rule that makes blue-green and canary safe: every migration must be backwards-compatible with the currently-live version. The old code has to keep working against the new schema. This forces the expand-contract pattern, and there's no way around it.

Say you're renaming users.email to users.email_address. You cannot do it in one migration. You do it across deploys:

-- Deploy 1 (expand): add the new column, backfill, dual-write via trigger.
-- Old code reads/writes email. New code reads/writes email_address. Both fine.
ALTER TABLE users ADD COLUMN email_address text;
 
UPDATE users SET email_address = email WHERE email_address IS NULL;
 
CREATE OR REPLACE FUNCTION sync_email() RETURNS trigger AS $$
BEGIN
  NEW.email_address := COALESCE(NEW.email_address, NEW.email);
  NEW.email         := COALESCE(NEW.email, NEW.email_address);
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER sync_email_trg
  BEFORE INSERT OR UPDATE ON users
  FOR EACH ROW EXECUTE FUNCTION sync_email();

Deploy 2 ships code that only uses email_address. Once that's stable and you're certain no live version reads email, deploy 3 (contract) drops the trigger and the old column. Three deploys to rename a column feels absurd the first time. It's also the only version where a mid-rollout rollback doesn't take down production. I've written a whole piece on zero-downtime migrations; the short version is that schema changes and code changes ship on different deploys, and the schema change always lands first and stays compatible with both.

Canary: don't bet 100%, bet 5%

Blue-green flips everyone at once. Brave, but you still find out about the bug after all your users do. Canary fixes that: route a small slice of traffic — 5% — to the new version, watch its metrics, and only widen if it behaves. The blast radius of a bad deploy drops from 100% to 5%, and the decision to promote becomes data, not vibes.

The mistake teams make is doing canary by hand: shift to 10%, eyeball Grafana, shift to 25%, get distracted, forget. Automate the analysis. Here's an Argo Rollouts canary that promotes itself on a schedule but runs an automated analysis against Prometheus at each step and aborts if the success rate drops:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  replicas: 20
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      trafficRouting:
        nginx:
          stableIngress: checkout
      steps:
        - setWeight: 5
        - pause: { duration: 5m }      # soak at 5%
        - analysis:                     # gate: bail if metrics are bad
            templates:
              - templateName: success-rate
        - setWeight: 25
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100

And the analysis template that does the actual judging — this is where automated rollback lives:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 1m
      count: 5
      successCondition: result[0] >= 0.99   # SLO: 99% of requests non-5xx
      failureLimit: 2                         # 2 bad reads -> abort + rollback
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="checkout-canary",code!~"5.."}[2m]))
            /
            sum(rate(http_requests_total{service="checkout-canary"}[2m]))

When success-rate drops below 0.99 twice, Argo Rollouts marks the analysis Failed, aborts the rollout, and shifts 100% of traffic back to the stable ReplicaSet — automatically, no human, no pager. That's the system that lets you deploy on a Friday afternoon. You can add a second metric on p99 latency the same way, so a deploy that's correct but 3x slower also rolls itself back.

Two things people get wrong here. First, soak time matters more than weight count. A 5% canary that runs for 5 minutes catches errors a 5% canary that runs for 20 seconds never sees, because low-frequency code paths need traffic volume to surface. Second, your metric query must scope to the canary service, not the aggregate — if you average canary errors across 95% healthy stable traffic, a canary throwing 50% errors barely moves the global number and your gate never fires.

Feature flags: the rollback that needs no deploy

Here's the thing senior engineers internalize and juniors fight: the fastest deployment strategy is often not deploying.

A canary still ships a binary and shifts pods. A feature flag ships the new code to 100% of servers dark — present but disabled — and then you turn it on for 5% of users at runtime. Rollback is flipping a boolean in a dashboard. No rebuild, no rollout, no pod churn. Sub-second, and it can target by user, region, or plan tier in ways traffic-weight canaries can't.

// Decouple "is this code deployed" from "is this code on".
// LaunchDarkly / OpenFeature-style gate, evaluated per request.
import { ldClient } from "./flags";
 
export async function getCheckoutQuote(user: User, cart: Cart) {
  const useNewPricingEngine = await ldClient.variation(
    "checkout.new-pricing-engine",
    { kind: "user", key: user.id, country: user.country, plan: user.plan },
    false, // default: old engine, if the flag service is unreachable
  );
 
  if (useNewPricingEngine) {
    return computeQuoteV2(user, cart); // new path, ramped 1% -> 100%
  }
  return computeQuoteV1(user, cart);   // battle-tested path, instant fallback
}

The default value being false is not an accident — if your flag provider has an outage, every request falls back to the proven path. Use OpenFeature (the CNCF vendor-neutral SDK spec) so you're not married to one provider, and back it with LaunchDarkly, Flagsmith, Unleash, or a homegrown Redis-backed flag table.

The tradeoff is honest: flags are code debt. Every flag is a branch you have to test, and a stale flag left on for a year is a landmine. The discipline is a flag has a lifespan — ship it, ramp it, clean it up. I delete flags in the same sprint they hit 100%. Teams that don't end up with thousands of dead conditionals and no idea which combinations are even reachable.

Use canary for infrastructure-level changes (new runtime, dependency bumps, the whole binary). Use flags for product-level changes (new pricing logic, a redesigned flow). They compose: ship behind a flag, canary the deploy, ramp the flag.

Health checks are the floor, not the ceiling

None of this works if Kubernetes thinks a broken pod is healthy. A readinessProbe that hits /healthz and returns 200 if the process is alive tells you nothing — the process is alive and throwing 500s on every real request. Probe something that exercises a real dependency:

readinessProbe:
  httpGet:
    path: /healthz/ready   # checks DB connectivity + downstream reachability
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3
livenessProbe:
  httpGet:
    path: /healthz/live    # process-internal only; deadlock detection
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

Keep liveness and readiness distinct. Liveness restarts a wedged process. Readiness pulls a pod out of rotation without killing it. Conflating them gives you crash loops during a transient DB blip. But understand the limit: health checks catch infrastructure failure. They will not catch a pod that's perfectly healthy and computing wrong prices. That's what the canary's metric analysis is for. Health checks are the floor. SLO-based automated rollback is the ceiling.

The decision framework

When someone asks which strategy to use, I run down this list:

Can two versions run against your data at once? No → you're stuck with recreate and a maintenance window. Fix that first; it's a design smell.
Is rollback speed your top concern, cost secondary? → Blue-green. One routing flip, sub-second revert, at 2x environment cost during the cutover.
Do you want to limit blast radius and let metrics gate promotion? → Canary with automated analysis. This is my default for stateless services.
Is the risky thing a product behavior, not infrastructure? → Feature flag. Ship dark, ramp by cohort, kill instantly.
Always, regardless of strategy: migrations are expand-contract and backwards-compatible. Schema lands before code. The schema works against both versions.

The mid-level instinct is to obsess over the rollout mechanism. The senior instinct is to obsess over the rollback. Make reverting boring — a label swap, a traffic weight, a boolean — and you stop batching, you ship smaller, and the deploys get less scary precisely because each one matters less. Eleven minutes of downtime taught me that the goal was never a flawless deploy. It was a reversible one.