GitOps with Argo CD: Declarative Deployments Done Right

The worst production incident I ever cleaned up was caused by a kubectl edit. Someone bumped a replica count by hand at 2am to ride out a traffic spike, it worked, and then everyone forgot. Three weeks later a routine deploy reset the replicas back to what the manifest in git said, the service got cut in half during peak, and we spent forty minutes staring at dashboards trying to understand why a deploy that "changed nothing" took down half the fleet.

The cluster and git disagreed, and nobody knew. That gap — between what's running and what's committed — is the entire problem GitOps exists to kill.

What GitOps actually means

GitOps is a simple claim with sharp consequences: git is the single source of truth for what runs in your cluster, and a controller continuously reconciles the live state to match git. Not "git is where we keep the YAML." Git is the desired state, full stop. If it isn't in a commit, it isn't real, and the controller will undo it.

That continuous reconciliation loop is the part people skip. A normal CI/CD pipeline runs kubectl apply once, at deploy time, and then walks away. GitOps never walks away. The controller wakes up every few minutes, diffs the cluster against git, and acts on the difference. My 2am replica edit would have been reverted within minutes, with an alert, instead of silently lurking for three weeks.

Push versus pull, and why pull wins

The traditional model is push. Your CI runner — GitHub Actions, GitLab CI, Jenkins — holds cluster credentials, builds your image, and pushes manifests into the cluster with kubectl apply or helm upgrade.

# The push model: CI reaches into the cluster
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Apply to prod
        env:
          KUBECONFIG_DATA: ${{ secrets.PROD_KUBECONFIG }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > kubeconfig
          KUBECONFIG=./kubeconfig kubectl apply -f k8s/

Look at what that requires. Your CI system holds admin credentials for production. Every PR author with the ability to edit a workflow file is one curl away from exfiltrating that kubeconfig. Your cluster's API server is reachable from your CI network. And the deploy is a fire-and-forget event — there's no record of what's actually running after the job exits, only what you told it to do once.

The pull model inverts this. A controller inside the cluster watches a git repo and pulls changes in. CI never touches the cluster. Its only job is to build an image and write a new tag into a git commit.

	Push (CI applies)	Pull (GitOps)
Cluster credentials	Live in CI	Never leave the cluster
API server exposure	Reachable from CI	Can be fully private
Deploy record	Job logs, ephemeral	Git history, permanent
Drift detection	None	Continuous
Self-healing	No	Yes
Rollback	Re-run old pipeline	`git revert`

The credential point alone sells it for me. In the pull model the cluster's API server can be private — no public endpoint, no inbound firewall rule for CI. The controller reaches out to git and your container registry, both of which are designed to be exposed. Your blast radius shrinks dramatically.

Argo CD's model: the Application resource

Argo CD (a CNCF graduated project — read argo-cd.readthedocs.io for the canonical docs) is the controller I reach for. Flux is the other strong choice; the concepts transfer, Argo CD just ships a UI that makes drift and health legible to people who don't live in kubectl.

The core abstraction is the Application: a custom resource that says "take the manifests at this path in this repo and make this namespace look like them."

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: checkout-api
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: payments
  source:
    repoURL: https://github.com/acme/deploy-config.git
    targetRevision: main
    path: apps/checkout-api/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: checkout
  syncPolicy:
    automated:
      prune: true      # delete resources removed from git
      selfHeal: true   # revert manual cluster changes
    syncOptions:
      - CreateNamespace=true
      - ApplyOutOfSyncOnly=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Two status fields drive everything Argo CD does:

Sync status — Synced or OutOfSync. Does the live state match the manifests at targetRevision? This is the drift signal.
Health status — Healthy, Progressing, Degraded, Missing. Are the resources actually working? Argo CD knows how to read a Deployment's rollout, a Service's endpoints, an Ingress's address, and it ships custom health checks for common CRDs.

The three flags in syncPolicy.automated are where GitOps stops being a diff tool and starts being a control loop:

prune: true means a resource you delete from git gets deleted from the cluster. Without it, removed manifests leak — orphaned ConfigMaps and Services pile up until someone audits.
selfHeal: true is my 2am incident's antidote. A manual kubectl edit flips the app to OutOfSync, and Argo CD immediately re-applies git. You physically cannot drift for long.
ApplyOutOfSyncOnly=true keeps each sync cheap by only touching resources that actually differ, which matters a lot once an app owns a few hundred objects.

Turn selfHeal on deliberately. It is exactly the behavior you want in production and exactly the behavior that will fight you during a hands-on incident. More on that below.

Repository structure: config, not rendered manifests

The repo layout decision that bites teams later is mixing application source with deployment config. Keep them in separate repos. Your app repo holds code and a Dockerfile. A deploy-config repo holds Kustomize bases and overlays. CI builds the image and commits a new tag into the config repo; Argo CD takes it from there.

I use Kustomize overlays for environments because they keep the diff between staging and production to the handful of things that genuinely differ — replica counts, resource limits, the image tag.

# apps/checkout-api/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: checkout
 
resources:
  - ../../base
 
replicas:
  - name: checkout-api
    count: 6
 
images:
  - name: ghcr.io/acme/checkout-api
    newTag: "2026.3.18-a1b9f04"   # CI writes this line, nothing else
 
patches:
  - target:
      kind: Deployment
      name: checkout-api
    patch: |-
      - op: replace
        path: /spec/template/spec/containers/0/resources/limits/memory
        value: 1Gi

The CI job for the app does exactly one write to this file — it updates newTag. That single-line commit is the deploy. The git history of your config repo becomes a complete, timestamped, attributed log of every production change, with a diff you can read in a PR. That audit trail is worth the setup on its own; I've answered "what changed at 14:32 UTC" with git log instead of grepping CI logs more times than I can count.

Secrets: never the plaintext value, always the encrypted one

The obvious objection: if everything lives in git, what about secrets? You do not put plaintext secrets in git. You put encrypted secrets in git, and decrypt them inside the cluster. Two mature options:

Sealed Secrets (Bitnami). A controller in the cluster holds a private key. You encrypt with kubeseal against the public key and commit the resulting SealedSecret, which is useless to anyone without the cluster's key.
SOPS (Mozilla) with age or a cloud KMS. You encrypt values in YAML, commit the encrypted file, and an Argo CD plugin or the Argo CD Vault Plugin decrypts at sync time.

# Sealed Secrets: encrypt locally, commit the result, never the plaintext
kubectl create secret generic stripe-key \
  --from-literal=api-key="$STRIPE_LIVE_KEY" \
  --dry-run=client -o yaml \
  | kubeseal --controller-namespace kube-system --format yaml \
  > apps/checkout-api/overlays/production/sealed-stripe-key.yaml
 
git add apps/checkout-api/overlays/production/sealed-stripe-key.yaml
git commit -m "rotate stripe live key"

This is the GitOps-shaped version of the secrets-management discipline I've written about before: the secret's lifecycle lives in git, but the plaintext only ever exists inside the cluster where the decryption key lives. For high-churn secrets, I lean toward a Vault/External Secrets setup where git holds only a reference and the live value is pulled at runtime — fewer commits, and rotation doesn't touch the deploy repo at all.

Rollback is a git revert

Here's the part that makes GitOps feel like cheating. A bad deploy is a bad commit, so a rollback is git revert.

# Production is on fire after the last deploy. Find it and undo it.
git -C deploy-config log --oneline -5
# a1b9f04 bump checkout-api to 2026.3.18-a1b9f04   <-- the bad one
# 7c3e221 raise checkout-api memory limit to 1Gi
# ...
 
git -C deploy-config revert --no-edit a1b9f04
git -C deploy-config push origin main
# Argo CD detects OutOfSync within ~3 min (or instantly with a webhook)
# and rolls the cluster back to the previous image tag.

No special rollback tooling, no "re-run the deploy job with an older SHA," no remembering which Helm release revision was good. You revert the commit, push, and the controller converges the cluster back. And critically, the revert is itself a commit — the rollback is in the audit trail too, attributed and timestamped, instead of being an out-of-band manual action nobody recorded.

If you want it instant rather than waiting on the reconcile interval, wire a git webhook to Argo CD so a push triggers an immediate refresh. I run a ~3 minute poll as the safety net and a webhook for speed; the poll is what catches drift even when webhooks fail.

When GitOps is worth it, and when it's overkill

I don't reach for Argo CD on every project. The setup has real cost: a controller to run and upgrade, a second repo to maintain, a Kustomize or Helm structure your team has to actually understand, and a selfHeal loop that will block you mid-incident if you forget to pause the app before hot-patching. (When you genuinely need to hand-edit during an outage, argocd app set <app> --sync-policy none first, fix, then reconcile your fix back into git.)

Use this to decide:

Reach for GitOps when:

You run more than one environment, or more than a couple of services on Kubernetes.
Multiple people deploy, and you need an audit trail of who changed what, when.
You want a private API server with no cluster credentials in CI.
Drift is a real risk because people have kubectl access to prod.

Skip it (for now) when:

A single app, single environment, one or two trusted operators. A plain kubectl apply in CI is honest and cheaper.
You're not on Kubernetes. GitOps tooling assumes a reconcilable declarative API; for raw VMs, Terraform with a remote backend and drift detection gets you most of the benefit.
You're still pre-product-market-fit and changing your deployment shape weekly. Add GitOps when the shape stabilizes, not before.

The heuristic I use: the moment a human can change production in a way that isn't recorded in git, you've outgrown push deploys. Everything before that, GitOps is overhead you're paying for discipline you don't yet need.

Checklist to get to your first synced Application

Create a deploy-config repo, separate from app code. Bases in base/, environments in overlays/.
Install Argo CD into an argocd namespace; lock down RBAC before exposing the UI.
Define one Application per service-environment pair, starting with selfHeal: false until you trust the manifests.
Move secrets to Sealed Secrets or SOPS before committing anything sensitive — never a plaintext Secret, not even once, since git remembers.
Make CI's only deploy step a one-line newTag bump committed to the config repo.
Flip selfHeal: true and prune: true once a few deploys have gone clean.
Document the rollback as git revert <sha> && git push. Put it in the runbook. Practice it once on staging so nobody learns it during an incident.

Get those in place and your cluster stops being a thing you mutate and starts being a thing that converges. The difference shows up the first time someone runs kubectl edit at 2am and Argo CD quietly puts it back — and tells you.