You probably don't need zero downtime

Today, Knock published an article on how they upgraded their RDS instances with zero downtime.

They use an Elixir tech stack, which has a special place in my heart.

One thing I want to comment on is when doing upgrades and migrations like this, consider if you need it to be zero downtime.

Companies will often spend a lot of resources investing in making upgrades and migrations with zero downtime without considering the return on investment.

There’s a high chance that whatever you’re working on doesn’t need to be up 100% of the time. We’re not all working on NASA Mars Rovers.

If you think about it, incidents happen all the time, and they cause downtime. When I worked at Venmo, we had outages sometimes.

You can drastically reduce your engineering burden by setting clear expectations with your customers early.

Valve Software does maintenance on Steam every Tuesday night. Gamers generally know and accept any brief outages in their services. They’re one of the most profitable companies per employee.

When planning infrastructure changes, don’t assume it must be zero downtime.

Most of the time it’s not worth the investment.


Master GitHub Actions with a Senior Infrastructure Engineer

As a senior staff infrastructure engineer, I share exclusive, behind-the-scenes insights that you won't find anywhere else. Get the strategies and techniques I've used to save companies $500k in CI costs and transform teams with GitOps best practices—delivered straight to your inbox.

Not sure yet? Check out the archive.

Unsubscribe at any time.