How Fastly used Kubernetes to scale our platform engineering practice
About five years ago, Fastly had a problem with scale. No, not our network. Fastly’s network continues to scale effortlessly, including recently breezing past a 353 Tbps* (terabits per second) capacity threshold we’ve been tracking internally. No, our problem was scaling how our dev teams worked together and the shared resources they used. That’s a common problem for any company with a vital and growing engineering function like Fastly, but for us, it came with a unique twist — because Fastly is one of the few companies on which the entire internet relies, and because our whole thing is instant digital experiences, our solution to internal scale had to not only be reliable and resilient but also very, very fast.
Enter Fastly’s Cloud and Container Services team. In 2020, the Platform Engineering team—now Foundation Engineering—was exploring ways to make Fastly’s engineering teams more effective and efficient. Around that time, a new engineering paradigm was gaining steam. Platform engineering is the practice of “designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era.” One of the key tools used in a platform-engineering-focused organization is an Internal Development Platform (IDP). IDPs greatly benefit individual engineers and the organizations they work for because they centralize control for cloud resources, security policies, user management, and more. In other words, they keep engineers focused on productivity and make it easy for organizations to allocate resources, onboard new hires, and more.
Today, we call the IDP our Foundation Engineering team built Elevation. To understand how Fastly’s Elevation platform works, I chatted with Danny Kulchinksy, one of the original members of Fastly’s Cloud and Container Services team.
Here’s how Fastly’s Elevation platform works
A platform like Elevation aims to provide a standardized interface and user experience for all of Fastly’s developers. Specifically, its current role in Fastly’s architecture is to provide common and centrally-owned infrastructure for the development teams building applications that control crucial aspects of our network like AutoPilot, which automagically load balances traffic between our Points of Presence (POPs) to improve performance, or Neptune, which runs Fastly’s TLS features. Previously, Fastly used custom Chef cookbooks per application to run these kinds of applications, which led to a lot of maintenance for each engineering team: not only writing cookbooks but also figuring out how to deploy the application, patching the servers, fixing downtime as it happens—which doesn’t always happen—the list goes on.
At its core, Elevation is Kubernetes (and many other tools from the Kubernetes ecosystem). Rather than individually managing their infrastructure, teams produce container images with standardized deployment patterns, enabling them to simply define where and how they want to deploy their application. From there, our Foundation Engineering team utilized controllers to perform all the necessary initialization, secrets management, and auto-scaling processes. What’s more, Elevation uses custom controllers to ensure that our workloads are always in policy over the long term too.
“So what we've done is built a controller that sits on each of the Elevation clusters. Once it detects that a new namespace is created, it automatically talks to Vault—an open source secrets storage system—and creates the secret namespace, the relevant policies, roles, and all the necessary machinery for the users to get started. If we need to change the policy over time, that gets rolled out automatically by the controller, too,” said Danny.
Driving adoption and the future of Elevation
Elevation’s success is largely due to the Cloud & Container Services team’s thoughtful planning, execution, and internal advocacy. The success and positive reviews from migrated teams haven’t hurt either, as Elevation has grown to serve 200+ services and 40+ teams and projects across Fastly.
“First and foremost, we knew Elevation needed to be very reliable and resilient but also simple. Because it is an adjustment for the engineering teams, and if it’s too hard or the benefits aren’t clear, they won’t adopt it. And it took quite a while to get the confidence of the various engineering teams because, at the beginning, nobody wanted to use Kubernetes. It was very new, there were a lot of jokes around it, and it took quite a bit of effort on our part to prove and demonstrate that this is a reliable and worthwhile platform to use. But since we started, not a single team that has made the switch so far has regretted it. They've all felt that they were better off than they were before.”
When asked how the team wants to grow the platform next, Danny said their main focus is always ensuring that our development teams continue to have a good experience using Elevation, even as its user base and complexity grow. He has to say “no” or “next year” to more proposed features than he did in Elevation’s early days, but the team’s aim remains the same: ensuring our users have the freedom to operate independently while ensuring they don’t break someone else’s service by mistake. Prometheus and Thanos for monitoring, with FluentD for automated metrics and log collection. But perhaps the most versatile tool from the Kubernetes ecosystem for Fastly is Kyverno, the policy engine. Its ability to mutate, validate, and generate resources upon creation or when they’re updated makes it especially powerful for Fastly. For example, if a developer tries to do something with Fastly’s infrastructure that is an insecure practice or out of policy—running an application as root, for example—Kyverno processes the deployment manifest against our validation policies and blocks the app from running.
Breaking glass
Abstracting infrastructure is a great expedient for the software development lifecycle, but what about during emergency scenarios, like an incident, when dev teams may need extended permissions to fix an issue? The Platform Engineering team thought of that. The Fastly development teams using Elevation have access to a unique automation called Break Glass—built using Kyverno—which extends their permissions on our production clusters.
“So essentially, service owners have a specific set of permissions for what they can do in a production environment. Generally, we don’t allow certain actions because they’re considered risky from a security perspective or we have some compliance requirements where we cannot just apply changes directly, we need to go through an approval process. But if there's an emergency, if there's an incident and the engineer has to go in and do something to fix it, they can Break Glass. Once they do so, two things happen. One is they get elevated permissions that are time-scoped, starting at two hours, but they can extend if needed. The second is that comprehensive audit tracing initiates at the same time. We know who did the Break Glass and what they did, so we can go back post-event and understand what happened. This feature has helped us reduce lag time in responding to incidents since it’s completely self-serve while ensuring that we are always compliant,” said Danny.
Learn how Fastly can help you scale your internal development platform to success. Sign up for a free account or join the conversation in our forum.
* 353 Tbps of connected global capacity as of June 30, 2024