Back to blog

Follow and Subscribe

Reliable Deployments for Large Kubernetes Fleet

Fernando Crespo Gravalos

Principal Software Engineer, Cloud & Containers

Lessons from Our Kubernetes CD Journey

If you’ve run Kubernetes past a few clusters, you’ve probably hit the same scaling issues we did: once you’re dealing with dozens of clusters and components, the operational load starts to pile up. Every release affects multiple environments, and even simple changes often require significantly more engineering time than expected.

Fastly’s internal Application Delivery Platform includes roughly 45–50 in-house and open-source components. As the platform expanded globally, the sheer coordination required to keep a large Kubernetes system healthy and up to date became a full-time job.

GitOps helped us maintain consistency early on, but at scale, it exposed gaps we couldn’t ignore. What we really lacked were two things Kubernetes doesn’t provide out of the box: an orchestration layer to manage multi-cluster rollouts, and automated validation to reduce manual checks. Without those, even simple updates required human intervention.

We ultimately built a lightweight orchestration layer on top of ArgoCD to automate what had become an unsustainable workflow. Here is how we did it.


Where We Started

Kubernetes is the main foundation layer of Fastly’s internal Application Delivery Platform, which offers a wide repertoire of capabilities: automatic TLS management (via Certainly), secrets management, service mesh, observability, Layer-7 HTTP routing, and way more.

As our platform expanded, what used to be straightforward to manage started requiring far more manual work than our team could sustain. A typical workflow looked like:

for env in all_environments:
  open_pull_request()
  merge_if_diff_looks_good() 
  check_dashboards_logs_metrics_alerts()

Sometimes if we were feeling brave enough we would batch changes:

for env in all_non_prod_environments:
  open_pull_request()
  merge_if_diff_looks_good() 
  check_dashboards_logs_metrics_alerts()

for env in all_prod_environments:
  open_pull_request()
  merge_if_diff_looks_good() 
  check_dashboards_logs_metrics_alerts()

However, this workflow quickly became a bottleneck for the platform growth and a reliability risk that drained team capacity. What we really needed was a standardized and automated flow to deploy applications safely and fast:

open_pull_request()
merge_if_diff_ok_and_tests_pass() 
for env in all_environments: 
  deploy() 
  validate()


What “Good” Continuous Delivery Required

At our scale, releasing software meant coordinating dozens of components across multiple clusters, regions, and environments. We needed a reliable process that could handle progressive rollouts and validation gates.

These were the key features for the new setup:

  • Multi-Cluster Aware: And this is where we usually clashed head-on with pure GitOps. We can’t simply deploy in all the clusters all the time, nor babysit releases from one cluster to another. We wanted an end-to-end pipeline where releases are promoted to the next stage based on health signals.

  • Health Validation: It should allow us to run validation health checks to signal the pipeline to either continue or abort.

  • Automation Flow: Change once, propagate to all environments.

  • Flexible: There are not two identical applications nor two identical changes, so the system should allow us to adapt the pipeline steps to make engineers feel comfortable.

ArgoCD - modern GitOps-based Continuous Delivery platform for Kubernetes - met these requirements well enough to get started.


How We Used ArgoCD

ArgoCD renders manifests (using Helm, Kustomize, or plain YAML) and caches them in Redis before applying them — or “syncing” them. ArgoCD doesn’t actually manage Helm releases; it simply runs a helm template, renders the manifests, and applies them directly to the cluster.

ArgoCD automatically generates multiple Application objects from an ApplicationSet template, whose parameters are defined by static or dynamic generators.

We rely heavily on the cluster generator, which produces these parameters based on cluster labels. This has been a blessing for two main reasons:

  1. Dynamic metadata injection. We can pass cluster metadata as Helm values, keeping all cluster information centralized. Register the cluster in ArgoCD, and every app automatically knows its environment.

  2. Single definition of truth. We only need to define each app once — the generator handles the rest.

Another interesting feature of the ApplicationSet controller is Progressive Syncs, which syncs OutOfSync Applications gradually across clusters. It fits perfectly with large-scale multi-cluster rollouts, but there’s one catch: it’s still an alpha feature.

Our architecture looked fairly simple: we defined ApplicationSets, ArgoCD generated one Application per cluster, and each was synced in order using the rollingSync (Progressive Syncs) strategy.


Validating the Changes

Since running a full end-to-end pipeline from test to production in a shared platform can be risky, we built and adopted several tools and frameworks to validate our changes at different stages of the delivery lifecycle.

During the Continuous Integration phase, we use Terratest to test the app's infrastructure and ensure that the Helm chart renders correctly, validating specific configurations relevant to our platform. For critical apps, we even spin up a local kind cluster and run full end-to-end tests to verify that the Helm chart installs and the workload behaves as expected.

To support a wide range of test scenarios, we adopted Behave, a Behavioral Driven Development framework that uses Gherkin syntax to describe all the test scenarios. These tests are backed by Python code. We built our own library, so in most cases we just need to write Gherkin – which it’s basically plain English.

On top of that, when a Pull Request is opened, we run a custom tool to post ArgoCD diffs as comments in the Pull Request. While there are a few off-the-shelf options for this, none of them matched our needs (some were unmaintained, others made strong assumptions about app structure). That’s why we built it in-house. This is how it works:

  1. Fetch the changed ArgoCD Applications and their live manifests using the ArgoCD API.

  2. Retrieve the Helm values and parameters used to render those manifests.

  3. Run helm template locally, exactly as ArgoCD would.

  4. Compute the diff and post it to GitHub (or to the terminal, if running locally).

At the Deployment phase, each app rolls out gradually — starting in our test clusters and progressing through the various environments until production. To control this progression, we rely on Post Sync Hooks that trigger:

  • Behave end-to-end tests.

  • Prometheus metric analysis, via a lightweight Python tool we built. We couldn’t use existing canary analysis frameworks since we can’t perform canaries on the platform components themselves — things like ingress controllers, service mesh, or distributed systems where blue/green strategies are impractical.

  • For some services, we run Locust to simulate production traffic.

Everything was great… until it wasn’t 😅


Where the Approach Broke Down

At first, Progressive Syncs felt like the missing piece. Gradual orchestration of ArgoCD Applications, with support for post-sync hooks we could use for health validation — Exactly what we wanted!

But one day you wake up, grab your coffee, and get ready for your Ingress Controller upgrade… Validations are in place, tests are solid, and you have a pipeline that goes all the way from test to prod. You merge your Pull Request, and suddenly realize that the production Ingress app started syncing at the same time as the test one.

We reported this issue, but from the maintainer’s point of view, it doesn’t have an easy fix. The root cause is architectural: there’s currently no way for the ApplicationSet controller to wait until all applications in its group have been reconciled before proceeding. That means even with Progressive Syncs enabled, the synchronization order can’t be guaranteed once ArgoCD’s reconciliation loop takes over. The ApplicationSet controller is responsible for both generating and managing rollout progression. In environments with numerous applications, this approach unfortunately induces race conditions.

We contributed with Pull Requests, joined community meetings, and opened Slack threads in the official channels. Over time, we realized the feature wasn’t actively maintained. Meanwhile, the ArgoCD core team seemed to be moving toward Source Hydrator and combining it with Git Branch promotion as a way to manage multi-cluster rollouts. However, that feature is still alpha — and honestly, we don’t feel comfortable building pipelines that depend on branch-based promotions.

That is why we decided to build our own orchestration layer.


Rolling Our Own Sync Orchestrator

We asked ourselves a simple question: Can we build a small Python CLI that does what Progressive Syncs should have done? We wanted something simple: just a wrapper around ArgoCD that we could plug into any workflow engine and trigger from a pipeline.

These were the requirements:

  • Generic by design: It should not know anything about our environment, i.e., it’s the user that has to provide an application selector, a project, and a revision. That way, it’s easy to work in potential new environments, locally, and within an orchestrator. If our pipeline fails, we can still run the tool locally or when we need to quickly sync a set of apps.

  • Asynchronous by default: the apps are synced concurrently.

  • Dry Run mode and tests: This will validate what it would be before actually syncing apps.


How Our Sync Orchestrator Works

Given a group of applications that match a filter and a desired revision, the tool follows a simple workflow:

  1. List all the apps in that group. 

  2. For each app (concurrently):

    • Wait until the app’s targetRevision == revision

    • Check status:

      1. If the app was already synced and healthy, skip this app

      2. If there’s an in-progress sync, just wait for its health, do not sync

      3. If none of the above apply, then sync and wait for healthiness

A typical execution looks like this:

uv run main.py --project=<PROJECT> --revision=0123456789abcedefghi --selector='appName=dummy,env=test'

Once we validated this simple tool worked as expected, we moved on to the next piece of the puzzle — orchestration.


Adding a Workflow Engine

With a reliable sync primitive in place, we still needed a way to sequence promotions across environments, introduce human checkpoints when required, and adapt the rollout behaviour depending on the change.

That orchestration layer needed to meet a few core requirements:

  • Support for pause, resume, and termination. Since we didn’t have the rigidity of Progressive Syncs, we wanted the ability for a human to intervene — to decide whether an application is ready to start the promotion process, to promote to a specific environment, or to let a sync soak for a given amount of time.

  • Run containers in Kubernetes. Stay close to ArgoCD.

It turns out that our team was evaluating Argo Workflows as an automation engine and for very specific data pipelines. It met every requirement — and we even discovered other teams in the community were using it alongside ArgoCD.

Since Argo Workflows exposes CRDs, all we had to do was package our WorkflowTemplates in a Helm chart and deploy them.

Each application promotion is defined by a simple WorkflowTemplate. The configuration is minimal and expressive:

promotions:
  - selector: appName=dummy,env=test
    manualJudgement: true
    soak: 1h
  - selector: appName=dummy,env=stg
  - selector: appName=dummy,stage=stg
  - selector: appName=dummy,stage=prd
    manualJudgement: true

Revision and projects are global values per application, passed as parameters when triggering the workflow in the CD pipeline.


How Our CD Pipeline Looks

Here’s what our pipeline now does end-to-end:

  1. Determine the project from where the application is defined in the repo.

  2. Get the revision from the last commit that modified the app’s Helm chart.

  3. Update the ApplicationSet with the new revision.

  4. Update the app’s WorkflowTemplate with the latest promotion configuration.

  5. Submit the workflow, using Argo Workflows’ CLI:

argo submit --from "workflowtemplate/promote-${project}-${app}" -p revision="${revision}"

We haven’t encountered any major problems with this new pipeline. Apps are synced in the expected order. The pipeline also gave us more flexibility: not all the changes require the same level of caution. A major component upgrade may require gradual rollouts, manual judgement, or soak periods, while a simple label change can often be safely synced across all environments in one step.


Conclusion

This setup gave us what we were missing from Progressive Syncs — control, visibility, and reliability — even if it meant stepping outside the pure GitOps playbook.

In the end, we realized that what matters most isn’t sticking to a doctrine, but delivering software safely and predictably at scale.


You can explore Fastly’s platform with a free developer account and experiment with delivery workflows and edge services at your own pace.