Deploy More

March 21, 2024

Here’s an excerpt from Stripe’s 2023 annual letter:

Teams at Stripe work tirelessly to deliver this industry-leading reliability, and we decided we should share a little about how they accomplish it. Since many outages at internet companies are in some way triggered by a change gone wrong, we can do so by walking through how new code gets deployed at Stripe— something that happens to one of our core API services approximately 400 times in a typical day.

Once a change is code-complete, it is evaluated by a battery of around 1.4 million tests. Stripe uses half a million CPU cores to execute more than 6 billion test runs each day. Tests ramp up in scope: from simple style checks, to unit tests that verify each component in isolation, to integration tests that verify that end-to-end systems work as expected. (These tests are often designed to exercise edge cases. For example, we ensure that all code works correctly at unusual times—on a leap day, or during a leap second.) Needless to say, if the change fails any tests, further deployment halts.

Once we’ve shown that a change works in theory, it’s time to ensure that it works in practice. Changes are rolled out carefully and incrementally, like one of those progressive allergy tests that involves first rubbing the peanut on your skin, then touching it on the edge of your lip, and then just nibbling the peanut to see if you break out in hives at any point along the way.

Changes first go to pre-production, a mock production environment with synthetic API traffic designed to mimic realistic integration patterns. Here we check that the change can not only be safely rolled out to production, but also that it can be safely undone if required. Following this, the change then rolls out to a single production machine with a small sliver of traffic, before gradually advancing to 0.5%, then 1%, then 5%, then 20%, and so on, of actual production traffic—with pauses along the way to observe the effects (no swollen tongue!). We roll all changes out to test-mode API traffic before hitting live-mode traffic.

Each progressive rollout is inspected against 55,000 different metrics. If at any moment the system detects anomalous telemetry, the machines running the new code are automatically withdrawn from the pool and traffic is redirected to machines running an older, known-good version.

If you want to go faster, you need to be able to deploy many small changes faster.

Join the 80/20 DevOps Newsletter

If you're an engineering leader or developer, you should subscribe to my 80/20 DevOps Newsletter. Give me 1 minute of your day, and I'll teach you essential DevOps skills. I cover topics like Kubernetes, AWS, Infrastructure as Code, and more.

Not sure yet? Check out the archive.

Unsubscribe at any time.