When you are building things meant to be run in the public cloud, you should expect random unexpected outages as part of your design. It's not a bug, that's how it is.

Today, I dealt with an AWS EKS cluster randomly lose its connection to one of its nodes.

Well, I can't even be sure if it's an AWS issue or a Kubernetes issue. Our services were redundant enough, so there was no impact. 🤷‍♂️

@njoseph_1 Crucially I find this to be true for on-premisse (or actually any other) type of deployments too.

It's just that folks sometimes equate public cloud with 100% uptime. While in reality it just means that ones things do down fixing the root cause is out of your hands. (Which can be both a good or a bad thing.)

The only real remedy is redundancy, so congrats on sailing through it without impact.

@antolius Thanks.

It's one thing reading about it in books and quite another to experience it first-hand.

I agree that this could happen in any deployment and the system should be resilient enough to handle failures like this.

Sign in to participate in the conversation

A Mastodon instance running on Thoughtworks infrastructure for its employees to interact with the Fediverse.