The Chaos Monkey

Netflix has a server architecture that currently serves a pretty high percentage of all of the internet’s traffic, due to their streaming video service.

One of the most interesting things about their server architecture is that they routinely attack their own systems. They have a tool called Chaos Monkey that randomly disables their own production instances to make sure they can survive that common type of failure without any customer impact.

Because there are several ways in which servers can fail, they’ve also employed a fleet of monkeys that attack all manner of servers — some that are too slow, some that aren’t connected up to the proper server groups, some that just look weird, etc. And finally, there’s a Chaos Gorilla that doesn’t just turn off individual servers, but occasionally wipes out an entire availability zone, as if Godzilla had destroyed an entire portion of the country.

The philosophy is simple: by building a server architecture that expects failure, the system as a whole can learn how to withstand bigger and tougher obstacles even if they don’t know exactly when or how they will occur in real life.



Ein durchaus interessanter Ansatz der sich mit der Systemtheorie deckt. Vordergründig stabile Systeme sind häufig fragiler, als Systeme, die häufiger mit Störungen konfrontiert sind.