Blog

Chaos Engineering: Preparing for the Worst

Written by Patrick Walsh | Oct 23, 2020 4:06:00 PM

In the digital world, we serve customers through technology. If services are down, we lose money—plain and simple. This is when it is vitally important to introduce chaos engineering. A tool created by Netflix engineers to prepare for the worst-case scenario.

Was it all just a dream? You woke up to the sounds of incessant notifications, automated alerts, customer complaints, and support team emails flooding your inbox. Everything is down… the whole system is either at a crawl or completely stopped. It is time to scramble. How many customers lost? How many SLAs breached? What is this going to cost?

In the digital world, we serve customers through technology. If services are down, we lose money—plain and simple. Those nonfunctional requirements are just as critical as the functional requirements. Resiliency, scalability, fault tolerance, availability, and 99.999% uptime aren’t just buzzwords, they are what can make or break the success of software solutions and, in turn, businesses. And just like any functional requirement, we need test plans, cases, and scenarios to validate the completion of these critical requirements.

This is when it is vitally important to introduce chaos engineering. The term was coined by engineers at Netflix and it refers to the technique of validating the resiliency of networked applications by intentionally and randomly terminating servers at runtime and observing its effects. In non-production environments with dev-prod parity, engineers can comfortably run chaotical scenarios to ensure a hardened system will go-live. We can shut down virtual machines and databases to test the not-so-happy path and make sure that services can handle unexpected events with retry logic, timeouts, error messaging, and the like.

Netflix is a pioneer in the microservice space and has been on the bleeding edge of distributed systems in a Cloud-native environment. Taking the best practices, they have learned while developing, monitoring, and scaling microservices, Netflix consistently releases open-source software to support a variety of use cases.

For chaos engineering, Netflix developed the Simian Army tools, most notably Chaos Monkey and Chaos Kong. Chaos Monkey has only one job: to terminate virtual machines and containers. As explained by Netflix: “exposing engineers to failures more frequently incentivizes them to build resilient services.” Chaos Kong takes this principle further by shutting down entire AWS regions, preparing engineers for the worst. In fact, Netflix has Chaos Monkey running in production and during business hours to keep engineers prepared. Engineers need to be in the mindset that outages should be expected.

In Q1 of 2020, with en masse stay-at-home orders, it is only natural to see certain companies spike in new subscribers and users. Netflix saw an additional 15.77 million paid subscribers jump on their platform to stream their way through the COVID-19 pandemic. Without the rapid scalability of the Cloud, few businesses could survive that sort of surge in user base. Netflix can survive an entire AWS region blackout with minimal overall performance degradation, so it’s no surprise that they can handle a spike about the size of a major metropolitan city.

Chaos engineering comes in hand with great tech talent, and SkillStorm is here to help you find the perfect fit for your IT team. if you’re looking to learn more about how SkillStorm can help find your organization tech talent trained in chaos engineering, shoot us an email at hire@skillstorm.com